Search code examples
pythonnlpnltktokenize

Tokenizing without breaking up key phrases


I have a string of text like s = 'hi, welcome to grade 3'

currently when I tokenize the string I get

tokens = ['hi', 'welcome', 'to', 'grade', '3']

How can I tokenize the string without generating separate tokens for common phrases like 'grade 3'

I would want the output to be something like

tokens = ['hi', 'welcome', 'to', 'grade 3']

I have a list of common phrases I want to keep in one token if that makes it simpler

Ultimately I don't want to make all of my tokens bigrams as I still need the single word tokens for other parts of the program


Solution

  • This code uses MWETokenizer.

    from nltk import word_tokenize
    from nltk.tokenize import MWETokenizer
    tk = MWETokenizer([('grade', '3')])
    tokens = tk.tokenize(word_tokenize('hi, welcome to grade 3'))
    words = [val.replace('_', ' ') for val in tokens]
    print(words)