NLTK tokenize but don't split named entities

I am working on a simple grammar based parser. For this I need to first tokenize the input. In my texts lots of cities appear (e.g., New York, San Francisco, etc.). When I just use the standard nltk word_tokenizer, all these cities are split.

from nltk import word_tokenize
word_tokenize('What are we going to do in San Francisco?')

Current output:

['What', 'are', 'we', 'going', 'to', 'do', 'in', 'San', 'Francisco', '?']

Desired output:

['What', 'are', 'we', 'going', 'to', 'do', 'in', 'San Francisco', '?']

How can I tokenize such sentences without splitting named entities?

Solution

Identify the named entities, then walk the result and join the chunked tokens together:

>>> from nltk import ne_chunk, pos_tag, word_tokenize
>>> toks = word_tokenize('What are we going to do in San Francisco?')
>>> chunks = ne_chunk(pos_tag(toks))
>>> [ w[0] if isinstance(w, tuple) else " ".join(t[0] for t in w) for w in chunks ]
['What', 'are', 'we', 'going', 'to', 'do', 'in', 'San Francisco', '?']

Each element of chunks is either a (word, pos) tuple or a Tree() containing the parts of the chunk.