Here is a simple example of Named Entity Recognition (NER) using the named entity recognition tool in the Natural Language Toolkit (nltk) library in Python:
import nltk
text = "Barack Obama was born in Hawaii. He was the 44th President of the United States."
tokens = nltk.word_tokenize(text)
entities = nltk.ne_chunk(tokens)
print(entities)
When I run this code in my Jupyter Notebook, I get this error.
"IndexError: string index out of range"
Am I missing any installation? Please advise.
Expected output:
(PERSON Barack/NNP Obama/NNP) (GPE Hawaii/NNP) (ORGANIZATION United/NNP States/NNPS)
nltk.ne_chunk
expects its input to be tagged tokens rather than just plain tokens, so I would recommend adding a tagging step between the tokenization and ne chunking via nltk.pos_tag
. ne chunking still would give you every token, chunked by entities if there are any detected. Since you want only the entities, you can check for if there is a tree in a particular chunk. Like the following:
text = "Barack Obama was born in Hawaii. He was the 44th President of the United States."
tokens = nltk.word_tokenize(text)
tagged_tokens = nltk.pos_tag(tokens)
entities = [chunk for chunk in nltk.ne_chunk(tagged_tokens) if isinstance(chunk, nltk.Tree)]
for entity in entities:
print(entity)
Please note that this code doesn't give exactly the output you want. Instead it gives:
(PERSON Barack/NNP)
(PERSON Obama/NNP)
(GPE Hawaii/NNP)
(GPE United/NNP States/NNPS)