classification document-classification tf-idf

TFIDF: tf implementation

I am implementing a classification tool and was experimenting with various TF versions: two logarithmic (correction inside/outside of the logarithm call), normalized, augmented, and the log-average. Apparently, there is a significant difference in my classifier accuracy modulated by these - as much as 5%. What is odd, however, is that I am unable to say in advance which one would perform better on a given dataset. I wonder if there is some work that I am missing, or, maybe, someone could share experience working with these?

Solution

Basically the increase in importance by adding a given term to a document should decrease with the number of appearence of the term. For instance, "car" appearing twice in a document implies that the term is much more important than appearing only once. However, if you compare a term appearing 20 times with the same term appearing 19, this difference should be lower.

What you are doing by specifying different normalisations is defining how quick the TF value saturates at some point.

You can try to correlate your findings with some information about average TF per document or similar metrics.