Search code examples
nltktokenizenamed-entity-recognition

NLTK tokenize but don't split named entities


I am working on a simple grammar based parser. For this I need to first tokenize the input. In my texts lots of cities appear (e.g., New York, San Francisco, etc.). When I just use the standard nltk word_tokenizer, all these cities are split.

from nltk import word_tokenize
word_tokenize('What are we going to do in San Francisco?')

Current output:

['What', 'are', 'we', 'going', 'to', 'do', 'in', 'San', 'Francisco', '?']

Desired output:

['What', 'are', 'we', 'going', 'to', 'do', 'in', 'San Francisco', '?']

How can I tokenize such sentences without splitting named entities?


Solution

  • Identify the named entities, then walk the result and join the chunked tokens together:

    >>> from nltk import ne_chunk, pos_tag, word_tokenize
    >>> toks = word_tokenize('What are we going to do in San Francisco?')
    >>> chunks = ne_chunk(pos_tag(toks))
    >>> [ w[0] if isinstance(w, tuple) else " ".join(t[0] for t in w) for w in chunks ]
    ['What', 'are', 'we', 'going', 'to', 'do', 'in', 'San Francisco', '?']
    

    Each element of chunks is either a (word, pos) tuple or a Tree() containing the parts of the chunk.