Search code examples
pythonnumpyscikit-learnsimilaritycosine-similarity

Cosine Similarity


I was reading and came across this formula:

enter image description here

The formula is for cosine similarity. I thought this looked interesting and I created a numpy array that has user_id as row and item_id as column. For instance, let M be this matrix:

M = [[2,3,4,1,0],[0,0,0,0,5],[5,4,3,0,0],[1,1,1,1,1]] 

Here the entries inside the matrix are ratings the people u has given to item i based on row u and column i. I want to calculate this cosine similarity for this matrix between items (rows). This should yield a 5 x 5 matrix I believe. I tried to do

df = pd.DataFrame(M)
item_mean_subtracted = df.sub(df.mean(axis=0), axis=1)
similarity_matrix = item_mean_subtracted.fillna(0).corr(method="pearson").values

However, this does not seem right.


Solution

  • Here's a possible implementation of the adjusted cosine similarity:

    import numpy as np
    from scipy.spatial.distance import pdist, squareform
    
    M = np.asarray([[2, 3, 4, 1, 0], 
                    [0, 0, 0, 0, 5], 
                    [5, 4, 3, 0, 0], 
                    [1, 1, 1, 1, 1]])
    
    M_u = M.mean(axis=1)
    item_mean_subtracted = M - M_u[:, None]
    similarity_matrix = 1 - squareform(pdist(item_mean_subtracted.T, 'cosine'))
    

    Remarks:

    • I'm taking advantage of NumPy broadcasting to subtract the mean.
    • If M is a sparse matrix, you could do something like ths: M.toarray().
    • From the docs:

      Y = pdist(X, 'cosine')
      Computes the cosine distance between vectors u and v,
      1 − u⋅v / (||u||2||v||2)
      where ||∗||2 is the 2-norm of its argument *, and u⋅v is the dot product of u and v.

    • Array transposition is performed through the T method.

    Demo:

    In [277]: M_u
    Out[277]: array([ 2. ,  1. ,  2.4,  1. ])
    
    In [278]: item_mean_subtracted
    Out[278]: 
    array([[ 0. ,  1. ,  2. , -1. , -2. ],
           [-1. , -1. , -1. , -1. ,  4. ],
           [ 2.6,  1.6,  0.6, -2.4, -2.4],
           [ 0. ,  0. ,  0. ,  0. ,  0. ]])
    
    In [279]: np.set_printoptions(precision=2)
    
    In [280]: similarity_matrix
    Out[280]: 
    array([[ 1.  ,  0.87,  0.4 , -0.68, -0.72],
           [ 0.87,  1.  ,  0.8 , -0.65, -0.91],
           [ 0.4 ,  0.8 ,  1.  , -0.38, -0.8 ],
           [-0.68, -0.65, -0.38,  1.  ,  0.27],
           [-0.72, -0.91, -0.8 ,  0.27,  1.  ]])