Search code examples
machine-learningscikit-learnnlplemmatizationtfidfvectorizer

what is use case of Tokenization and Lemmatization in NLP when we have CountVectorizer and Tfidfvectorizer


i am learning the NLP and gone through;tokenization,Lemmatization Parts of speech and other basics. I came to know CountVectorizer and Tfidfvectorizer are there from sklearn which having internal ability to apply tokenization, Lemmatization.

so question is :

when i need use the core NLP activities to get the vocabulary instead of using CountVectorizer and Tfidfvectorizer?


Solution

  • Tokenization and Lematization are the basic building blocks in NLP. Using Tokenization you break the string into tokens/words. Tokenization depends on the language of the text, how the text is formed etc. For example tokenizing a chinese text is different from that of english and is different from a tweet. So there exist different kinds of Tokenizers.

    CountVectorizer and Tfidfvectorizer are used to vectorize a block of text which rely on the words with in the text. So they need a mechanism to tokenize the words and they support the mechanism to send in our tokenizers (via a callable methods passed as argument). If we dont pass in any tokenizer it uses naive way of splitting over spaces.

    See the docs of CountVectorizer

    tokenizer: callable, default=None

    Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Only applies if analyzer == 'word'.

    So they allow us to pass in our own tokenizers. Same applies for Leamatization.