Search code examples
python-3.xnltktokenize

NLTK.word_tokenize splitting word(Slang) on it's own


  word_data = 'Why you gotta be so rude'
  nltk_tokens = nltk.word_tokenize(word_data)
  print(nltk_tokens)

OUTPUT: ['Why', 'you', 'got', 'ta', 'be', 'so', 'rude']

Can someone explain why gotta got split into got and ta?


Solution

  • I suspect its treating it similar to a contraction, e.g. my understanding is it would split "you're" in to the tokens " you " and " 're ".

    If you don't want it to split this pseudo-word, you may be able to use wordpunct_tokenize, which according to the docs uses a simpler tokenizing algorithm that just focuses on splitting around whitespace/punctuation.