Recently I started reading more about NLP and following tutorials in Python in order to learn more about the subject. While following one of the tutorials I observed that they were using the sparse matrix of word counts in each tweet (created with CountVectorizer) as input to TfidfTransformer which handles the data and feeds it to the classifier for training and prediction.
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', LogisticRegression())
])
As no explanation was provided, I can't understand the thought process behind this... Isn't it just a regular Bag of Words? Can't this be done by using just one of the functions, for example, just Tfidf?
Any clarification would be greatly appreciated.
Bag of words is what CountVectorizer
does – building vector with word counts for each sentence.
TfIdf
takes the BoW and transform that matrix to, well, tf-idf – frequency in sentence + inverted document frequency.
This part of pipeline can be substituted with TfidfVectorizer
– its actually BoW + TfIdf. Later is rarely used without BoW, so combined version makes sense if classifier is all you need at the end of the day