Search code examples
information-retrievalvsmcosine-similaritytf-idf

Cosine similarity and tf-idf


I am confused by the following comment about TF-IDF and Cosine Similarity.

I was reading up on both and then on wiki under Cosine Similarity I find this sentence "In case of of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90."

Now I'm wondering....aren't they 2 different things?

Is tf-idf already inside the cosine similarity? If yes, then what the heck - I can only see the inner dot products and euclidean lengths.

I thought tf-idf was something you could do before running cosine similarity on the texts. Did I miss something?


Solution

  • Tf-idf is a transformation you apply to texts to get two real-valued vectors. You can then obtain the cosine similarity of any pair of vectors by taking their dot product and dividing that by the product of their norms. That yields the cosine of the angle between the vectors.

    If d2 and q are tf-idf vectors, then

    enter image description here

    where θ is the angle between the vectors. As θ ranges from 0 to 90 degrees, cos θ ranges from 1 to 0. θ can only range from 0 to 90 degrees, because tf-idf vectors are non-negative.

    There's no particularly deep connection between tf-idf and the cosine similarity/vector space model; tf-idf just works quite well with document-term matrices. It has uses outside of that domain, though, and in principle you could substitute another transformation in a VSM.

    (Formula taken from the Wikipedia, hence the d2.)