Search code examples
pythonnlptokentokenizegensim

How to specify additional tokens for tokenizator?


I want to tokenize text with gensim.utils.tokenize(). And I want to add some phrases that would be recognized as single tokens, for example: 'New York', 'Long Island'.

Is it possible with gensim? If not, what other libraries is it possible to use?


Solution

  • I've found the solution with nltk:

    tokenizer = nltk.tokenize.mwe.MWETokenizer([('hors', "d'oeuvre")], separator=' ')
    tokenizer.tokenize("An hors d'oeuvre tonight, sir?".split())
    
    ['An', "hors d'oeuvre", 'tonight,', 'sir?']