sentiment-analysis tf-idf text-classification naivebayes

In general, when does TF-IDF reduce accuracy?

I'm training a corpus consisting of 200000 reviews into positive and negative reviews using a Naive Bayes model, and I noticed that performing TF-IDF actually reduced the accuracy (while testing on test set of 50000 reviews) by about 2%. So I was wondering if TF-IDF has any underlying assumptions on the data or model that it works with, i.e. any cases where accuracy is reduced by the use of it?

Solution

The IDF component of TF*IDF can harm your classification accuracy in some cases.

Let suppose the following artificial, easy classification task, made for the sake of illustration:

Class A: texts containing the word 'corn'
Class B: texts not containing the word 'corn'

Suppose now that in Class A, you have 100 000 examples and in class B, 1000 examples.

What will happen to TFIDF? The inverse document frequency of corn will be very low (because it is found in almost all documents), and the feature 'corn' will get a very small TFIDF, which is the weight of the feature used by the classifier. Obviously, 'corn' was THE best feature for this classification task. This is an example where TFIDF may reduce your classification accuracy. In more general terms:

when there is class imbalance. If you have more instances in one class, the good word features of the frequent class risk having lower IDF, thus their best features will have a lower weight
when you have words with high frequency that are very predictive of one of the classes (words found in most documents of that class)