Search code examples
pythonnltktweepyanalysis

Why tokenize/preprocess words for language analysis?


I am currently working on a Python tweet analyser and part of this will be to count common words. I have seen a number of tutorials on how to do this, and most tokenize the strings of text before further analysis.

Surely it would be easier to avoid this stage of preprocessing and count the words directly from the string - so why do this?


Solution

  • Perhaps I'm being overly correct, but doesn't tokenization simply refer to splitting up the input stream (of characters, in this case) based on delimiters to receive whatever is regarded as a "token"?

    Your tokens can be arbitrary: you can perform analysis on the word level where your tokens are words and the delimiter is any space or punctuation character. It's just as likely that you analyse n-grams, where your tokens correspond to a group of words and delimiting is done e.g. by sliding a window.

    So in short, in order to analyse words in a stream of text, you need to tokenize to receive "raw" words to operate on.

    Tokenization however is often followed by stemming and lemmatization to reduce noise. This becomes quite clear when thinking about sentiment analysis: if you see the tokens happy, happily and happiness, do you want to treat them each separately, or wouldn't you rather combine them to three instances of happy to better convey a stronger notion of "being happy"?