Search code examples
named-entity-recognition

Simple NER - IndexError: string index out of range error


Here is a simple example of Named Entity Recognition (NER) using the named entity recognition tool in the Natural Language Toolkit (nltk) library in Python:

import nltk

Input text

text = "Barack Obama was born in Hawaii. He was the 44th President of the United States."

Tokenize the text

tokens = nltk.word_tokenize(text)

Perform named entity recognition

entities = nltk.ne_chunk(tokens)

Print the named entities

print(entities)

When I run this code in my Jupyter Notebook, I get this error.

"IndexError: string index out of range"

Am I missing any installation? Please advise.

Expected output:

(PERSON Barack/NNP Obama/NNP) (GPE Hawaii/NNP) (ORGANIZATION United/NNP States/NNPS)


Solution

  • nltk.ne_chunk expects its input to be tagged tokens rather than just plain tokens, so I would recommend adding a tagging step between the tokenization and ne chunking via nltk.pos_tag. ne chunking still would give you every token, chunked by entities if there are any detected. Since you want only the entities, you can check for if there is a tree in a particular chunk. Like the following:

    text = "Barack Obama was born in Hawaii. He was the 44th President of the United States."
    tokens = nltk.word_tokenize(text)
    tagged_tokens = nltk.pos_tag(tokens)
    entities = [chunk for chunk in nltk.ne_chunk(tagged_tokens) if isinstance(chunk, nltk.Tree)]
    
    for entity in entities:
        print(entity)
    

    Please note that this code doesn't give exactly the output you want. Instead it gives:

    (PERSON Barack/NNP)
    (PERSON Obama/NNP)
    (GPE Hawaii/NNP)
    (GPE United/NNP States/NNPS)