Tokenizing without breaking up key phrases

I have a string of text like s = 'hi, welcome to grade 3'

currently when I tokenize the string I get

tokens = ['hi', 'welcome', 'to', 'grade', '3']

How can I tokenize the string without generating separate tokens for common phrases like 'grade 3'

I would want the output to be something like

tokens = ['hi', 'welcome', 'to', 'grade 3']

I have a list of common phrases I want to keep in one token if that makes it simpler

Ultimately I don't want to make all of my tokens bigrams as I still need the single word tokens for other parts of the program

Solution

This code uses MWETokenizer.

from nltk import word_tokenize
from nltk.tokenize import MWETokenizer
tk = MWETokenizer([('grade', '3')])
tokens = tk.tokenize(word_tokenize('hi, welcome to grade 3'))
words = [val.replace('_', ' ') for val in tokens]
print(words)