from nltk.tokenize import RegexpTokenizer
text="That's some text, you know!"
tokens=[]
tokenizer = RegexpTokenizer(r'\w+')
tokens+=tokenizer.tokenize(text.lower())
Currently returns: text = ['that', 's', 'some', 'text', 'you', 'know']
I need it to return: Currently returns: text = ['thats', 'some', 'text', 'you', 'know']
(The "thats" is one word)
There are 2 solutions. Either you want to preprocess your text variable with:
text = text.replace("'", "")
or you want to match "that's" as a single word with this modification:
tokenizer = RegexpTokenizer(r'[\w\']+')