Named Entity Recognition
Named Entity Recognition (NER) is a natural language processing task that involves identifying and categorizing named entities in text into predefined categories such as persons, organizations, locations, dates, and more. It helps extract important information from unstructured text data.
In simpler terms, NER is like a highlighter for important words in a text. For example, it can identify names of people, places, and organizations mentioned in a news article or a social media post.
In summary, Named Entity Recognition (NER) is a technique used to automatically identify and classify specific entities in text, making it easier to understand and analyze large amounts of textual data.
for example: Birendra lives in kathmandu
. In this sentence, Birendra
should be categorized as person
and kathmandu
as a location.
steps:
- Toeknize the sentence and add [CLS] at the begining and [SEP] at the end.
- Feed the input tokens to pre-trained BERT and obtain tokens representation.
- Use classifier, Feed Forward neurla netwrok and softmax.
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
example = "I am Birendra, King of Nepal and I Live in Kathmandu."
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
tokenizer_config.json: 100%|██████████| 59.0/59.0 [00:00<00:00, 12.4kB/s] config.json: 100%|██████████| 829/829 [00:00<00:00, 83.9kB/s] vocab.txt: 100%|██████████| 213k/213k [00:00<00:00, 2.88MB/s] added_tokens.json: 100%|██████████| 2.00/2.00 [00:00<00:00, 1.01kB/s] special_tokens_map.json: 100%|██████████| 112/112 [00:00<00:00, 51.6kB/s] model.safetensors: 100%|██████████| 433M/433M [00:59<00:00, 7.27MB/s] Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight'] - This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
cls = pipeline("ner", model=model, tokenizer=tokenizer)
for item in cls(example):
print(f'{item["word"]} [{item["entity"]}] Score: {item["score"]}')
B [B-PER] Score: 0.9995802044868469 ##ire [B-PER] Score: 0.9978145360946655 ##ndra [B-PER] Score: 0.49515867233276367 Nepal [B-LOC] Score: 0.9997472167015076 Kat [B-LOC] Score: 0.9995753169059753 ##hman [I-LOC] Score: 0.9938346147537231 ##du [I-LOC] Score: 0.995941698551178