machine-learning scikit-learn nlp lemmatization tfidfvectorizer

what is use case of Tokenization and Lemmatization in NLP when we have CountVectorizer and Tfidfvectorizer

i am learning the NLP and gone through;tokenization,Lemmatization Parts of speech and other basics. I came to know CountVectorizer and Tfidfvectorizer are there from sklearn which having internal ability to apply tokenization, Lemmatization.

so question is :

when i need use the core NLP activities to get the vocabulary instead of using CountVectorizer and Tfidfvectorizer?

Solution

Tokenization and Lematization are the basic building blocks in NLP. Using Tokenization you break the string into tokens/words. Tokenization depends on the language of the text, how the text is formed etc. For example tokenizing a chinese text is different from that of english and is different from a tweet. So there exist different kinds of Tokenizers.

CountVectorizer and Tfidfvectorizer are used to vectorize a block of text which rely on the words with in the text. So they need a mechanism to tokenize the words and they support the mechanism to send in our tokenizers (via a callable methods passed as argument). If we dont pass in any tokenizer it uses naive way of splitting over spaces.

See the docs of CountVectorizer

tokenizer: callable, default=None

Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Only applies if analyzer == 'word'.

So they allow us to pass in our own tokenizers. Same applies for Leamatization.