Search code examples
tokenizeopennlp

Tokenizing place like New York


I have been using tokenizer of NLP, However I am not sure about the behavior, when it comes to places. If I give New York or Mexico City, the tokenizer is splitting that into New and York respectively.

However I want it to be just New York. Are there any tokenizers to achieve this, and if not how to achieve this result?

Thanks


Solution

  • Your tokenizer is behaving correctly. New and York are two different tokens. What you want to do is something called chunking. Here is some information about chunking to give you some background.

    Depending on which NLP library you are using, there is probably some functionality built in for chunking. For OpenNLP, which you included in your question tags, see this related question: How to extract the noun phrases using Open nlp's chunking parser