Search code examples
recommendation-enginecosine-similaritycollaborative-filtering

Choice between an adjusted cosine similarity vs regular cosine similarity


I'm working on an item-based CF which uses an adjusted cosine similarity. I have recently added a regular cosine similarity and I got totally different results. Now my question is which fits better considering my data?

This is a possible scenario of users, items and ratings

         User 1 | User 2 | User 3 | User 4 | User 5
Item 1 |   5    |    1   |   1    |   5    |   5
Item 2 |   5    |    1   |   2    |   4    |   5
Item 3 |   1    |    5   |   4    |   2    |   3

Considering this data, you'd conclude that item 1 and item 2 are relatively 'similar'. Here are the results of the different similarity coefficients:

Similarity between Item 1 and Item 2
Adjusted cosine similarity = 0.865
Regular cosine similarity = 0.987
I rounded them off for this example

You can see this is basically the same, but when you try to calculate a similarity between Item 2 and 3 (Which aren't similar at all) it results in totally different results:

Similarity between Item 2 and Item 3
Adjusted cosine similarity = -0.955
Regular cosine similarity = 0.656
I rounded them off for this example

Which of these would be 'better'? I assume using an adjusted cosine similarity works better since it take the average rating of the user into account, but why would a regular cosine similarity result in a positive number for such 'different' items? Should I just refrain from using the regular cosine similarity in general or only for certain scenarios?

Any help would be appreciated!


Solution

  • Why would a regular cosine similarity result in a positive number for such 'different' items?

    As you already mentioned in the example, Adjusted Cosine Similarity reflects the differences better than the Regular Cosine Similarity in certain circumstances.

    Regular Cosine Similarity by the definition reflects differences in direction, but not the location.

    enter image description here

    dist(A,B) is the Euclidean Distance between A and B. It's clear that the cosine similarity will remain the same if any vector extends in its own direction.

    Let's assume the user give scores in 0~5 to two movies.

    from scipy import spatial
    import numpy as np
    a = np.array([2.0,1.0])  
    b = np.array([5.0,3.0])
    1 - spatial.distance.cosine(a,b)
    #----------------------
    # 0.99705448550158149
    #----------------------
    c = np.array([5.0,4.0])
    1 - spatial.distance.cosine(c,b)
    #----------------------
    # 0.99099243041032326
    #----------------------
    

    enter image description here

    Intuitively we would say user b and c have similar tastes, and a is quite different from them. But the regular cosine similarity tells us a wrong story.

    Let's calculate the Adjusted Cosine Similarity, first minus the mean of x and y

    mean_ab = sum(sum(a,b)) / 4  
    # mean_ab : 3.5
    # adjusted vectors : [-1.5, -2.5] , [1.5, -0.5]
    1 - spatial.distance.cosine(a - mean_ab, b - mean_ab)
    #----------------------
    # -0.21693045781865616
    #----------------------
    mean_cb = sum(sum(c,b)) / 4
    # mean_cb : 6.5
    # adjusted vectors : [-1.5, -3.5] , [-1.5, -2.5]
    1 - spatial.distance.cosine(c - mean_cb, b - mean_cb)
    #----------------------
    # 0.99083016804429891
    #----------------------
    

    It's clear to see the adjustment is meaningful.

    Should I just refrain from using the regular cosine similarity in general or only for certain scenarios?

    When you find there's a problem, use the suitable one.

    I still think the regular cosine similarity is useful in scenarios where we want less sensitivity on the scale of vectors. For example, if the scores [2,1] is considered as very similar to [4,2] or [8,4], the regular will do a fine job.