Search code examples
pythonscikit-learnpipelinecountvectorizertfidfvectorizer

CountVectorizer output that serves as TfidfTransformer input vs. TfidfTransformer()


Recently I started reading more about NLP and following tutorials in Python in order to learn more about the subject. While following one of the tutorials I observed that they were using the sparse matrix of word counts in each tweet (created with CountVectorizer) as input to TfidfTransformer which handles the data and feeds it to the classifier for training and prediction.

pipeline = Pipeline([
     ('vect', CountVectorizer()),
     ('tfidf', TfidfTransformer()),
     ('clf', LogisticRegression())
])

As no explanation was provided, I can't understand the thought process behind this... Isn't it just a regular Bag of Words? Can't this be done by using just one of the functions, for example, just Tfidf?

Any clarification would be greatly appreciated.


Solution

  • Bag of words is what CountVectorizer does – building vector with word counts for each sentence.

    TfIdf takes the BoW and transform that matrix to, well, tf-idf – frequency in sentence + inverted document frequency.

    This part of pipeline can be substituted with TfidfVectorizer – its actually BoW + TfIdf. Later is rarely used without BoW, so combined version makes sense if classifier is all you need at the end of the day