Simple NER - IndexError: string index out of range error

Here is a simple example of Named Entity Recognition (NER) using the named entity recognition tool in the Natural Language Toolkit (nltk) library in Python:

import nltk

Input text

text = "Barack Obama was born in Hawaii. He was the 44th President of the United States."

Tokenize the text

tokens = nltk.word_tokenize(text)

Perform named entity recognition

entities = nltk.ne_chunk(tokens)

Print the named entities

print(entities)

When I run this code in my Jupyter Notebook, I get this error.

"IndexError: string index out of range"

Am I missing any installation? Please advise.

Expected output:

(PERSON Barack/NNP Obama/NNP) (GPE Hawaii/NNP) (ORGANIZATION United/NNP States/NNPS)

Solution

nltk.ne_chunk expects its input to be tagged tokens rather than just plain tokens, so I would recommend adding a tagging step between the tokenization and ne chunking via nltk.pos_tag. ne chunking still would give you every token, chunked by entities if there are any detected. Since you want only the entities, you can check for if there is a tree in a particular chunk. Like the following:

text = "Barack Obama was born in Hawaii. He was the 44th President of the United States."
tokens = nltk.word_tokenize(text)
tagged_tokens = nltk.pos_tag(tokens)
entities = [chunk for chunk in nltk.ne_chunk(tagged_tokens) if isinstance(chunk, nltk.Tree)]

for entity in entities:
    print(entity)

Please note that this code doesn't give exactly the output you want. Instead it gives:

(PERSON Barack/NNP)
(PERSON Obama/NNP)
(GPE Hawaii/NNP)
(GPE United/NNP States/NNPS)