Search code examples
data-miningtext-miningtf-idfcosine-similaritylinguistics

Why are Cosine Similarity and TF-IDF used together?


TF-IDF and Cosine Similarity is a commonly used combination for text clustering. Each document is represented by vectors of TF-IDF weights.

This is what my text book says.

With Cosine Similarity you can then compute the similarities between those documents.

But why are exactly those techniques used together?
What is the advantage?

Could for example Jaccard Similarity also be used?

I know, how it works, but I want to know, why exactly these techniques.


Solution

  • TF-IDF is the weighting used.

    Cosine is the measure used.

    You could use cosine without weighting, but results then usually are worse. Jaccard works on sets - it's not obvious how to use weights without turning it into something else without making it the same as Cosine.