I am working on a simple grammar based parser. For this I need to first tokenize the input. In my texts lots of cities appear (e.g., New York, San Francisco, etc.). When I just use the standard nltk word_tokenizer, all these cities are split.
from nltk import word_tokenize
word_tokenize('What are we going to do in San Francisco?')
Current output:
['What', 'are', 'we', 'going', 'to', 'do', 'in', 'San', 'Francisco', '?']
Desired output:
['What', 'are', 'we', 'going', 'to', 'do', 'in', 'San Francisco', '?']
How can I tokenize such sentences without splitting named entities?
Identify the named entities, then walk the result and join the chunked tokens together:
>>> from nltk import ne_chunk, pos_tag, word_tokenize
>>> toks = word_tokenize('What are we going to do in San Francisco?')
>>> chunks = ne_chunk(pos_tag(toks))
>>> [ w[0] if isinstance(w, tuple) else " ".join(t[0] for t in w) for w in chunks ]
['What', 'are', 'we', 'going', 'to', 'do', 'in', 'San Francisco', '?']
Each element of chunks
is either a (word, pos)
tuple or a Tree()
containing the parts of the chunk.