Search code examples
machine-learningdata-analysiscosine-similarityrecommendation-engine

I have two formulas for calculating 'cosine similarity', what's the difference?


I am doing a project about cosine similarity on movie dataset, I'm confused about the formula for calculating cosine similarity.

enter image description here

But I searched online, some articles show that the denominator is something like : sqrt(A1^2+B1^2) * sqrt(A2^2+B2^2) * ... * sqrt(Ai^2+Bi^2)

I'm confused, what's the difference? Which one is correct or they both are correct?


Solution

  • The one on your image is correct. In two dimensions, it is derived from the Law of cosines which relates the length of one side of a triangle to the length of the other two sides, and the angle opposite c, theta:

    c^2==a^2+b^2-2*b*c(cos(theta))

    You can prove this in many ways, and a good verification is to know that when cos(gamma)==0 (side a and b are orthogonal), you get the Pythagorean Theorem. To get the formula on the image, you have to translate it into analytical geometry (vectors)

    norm(A-B)^2==norm(A)^2+norm(B)^2−2*norm(A)*norm(B)*cos(theta)

    and by using that norm(A-B)^2 is by definition (A-B)*(A-B) and expanding we get

    norm(A-B)^2 ==norm(A)^2+norm(B)^2-2*A*B

    So equating both expressions, and doing cancellations, yields

    norm(A)*norm(B)*cos(theta) = A*B

    which is the (rearranged) formula on your definition (and the norm(v) = sqrt(v*v)). For n dimensions you can show this works because rotating the euclidean space preserves norm and inner product, and because the 2D plane spanned by the vectors is precisely just a rotation of the xy plane.

    A good sanity check is, again that orthogonality yields a cosine of 0, and that the cosine is between 0 and 1 (this is the Cauchy Schwarz theorem)

    Update: In the examples mentioned on your comment, you can see the results from the blog by running

    import sklearn.metrics.pairwise as pw
    print(pw.cosine_similarity([[4,3]],[[5,5]]))
    print(pw.cosine_similarity([[4,3,5]],[[5,5,1]]))
    

    note that if you run:

    from sklearn.metrics.pairwise import pairwise_distances
    print(pairwise_distances([[4,3,5]],[[5,5,1]],metric='cosine')) 
    

    You get 0.208 instead of 0.792, this is because pairwise_distance using the cosine metric is given as 1-cos(theta) (see that 0.208 + 0.792 is 1). You do this transformation because when you talk about distances, you want the distance from a point to itself to be 0.