word_data = 'Why you gotta be so rude'
nltk_tokens = nltk.word_tokenize(word_data)
print(nltk_tokens)
OUTPUT: ['Why', 'you', 'got', 'ta', 'be', 'so', 'rude']
Can someone explain why gotta
got split into got
and ta
?
I suspect its treating it similar to a contraction, e.g. my understanding is it would split "you're" in to the tokens " you " and " 're ".
If you don't want it to split this pseudo-word, you may be able to use wordpunct_tokenize
, which according to the docs uses a simpler tokenizing algorithm that just focuses on splitting around whitespace/punctuation.