I have a string of text like s = 'hi, welcome to grade 3'
currently when I tokenize the string I get
tokens = ['hi', 'welcome', 'to', 'grade', '3']
How can I tokenize the string without generating separate tokens for common phrases like 'grade 3'
I would want the output to be something like
tokens = ['hi', 'welcome', 'to', 'grade 3']
I have a list of common phrases I want to keep in one token if that makes it simpler
Ultimately I don't want to make all of my tokens bigrams as I still need the single word tokens for other parts of the program
This code uses MWETokenizer
.
from nltk import word_tokenize
from nltk.tokenize import MWETokenizer
tk = MWETokenizer([('grade', '3')])
tokens = tk.tokenize(word_tokenize('hi, welcome to grade 3'))
words = [val.replace('_', ' ') for val in tokens]
print(words)