Search code examples
pythonpandasdataframecosine-similarity

How to interpret cosine similarity output in python


beginner @ Python here. I have a pandas DataFrame df with the columns: userID, weight, SEI, name.

#libraries 
   import numpy as np; import pandas as pd
   from sklearn.metrics.pairwise import cosine_similarity
    
#dataframe
   userID    weight     SEI        name
   3         125.0.     0.562140   263
   4         254.0.     0.377294   869 
   5         451.0.     0.872896   196
   1429      451.0.     0.872896   196 
   5         129.0.     0.569432   582
   ...       ...        ...        ...

#output
   cosine_similarity(df)

   array([[1.        , 0.98731894, 0.75370844, ..., 0.33814175, 0.33700687, 0.24443919],
   [0.98731894, 1.        , 0.63987877, ..., 0.35037059, 0.34963404, 0.23870279],
   [0.75370844, 0.63987877, 1.        , ..., 0.16648431, 0.16403693, 0.17438159], 
   ...,

The person with userID 3 has a weight of 125.0, and SEI of 0.562140. The person with name 263 also has a weight of 125.0, and SEI of 0.562140. (I had to use a label encoder for the name column because I could not run the cosine similarity function without changing the column data type. Hopefully this doesn't affect the end goal?)

The goal is to match up values from the column userID to values in the column name using cosine similarity on all rows. I just need some guidance in interpreting the output in order to do this. All I know is the higher the cosine value the greater the similarity.

Any help is appreciated!


Solution

  • Make it easier for yourself and group by two columns

    result1=df.sort_values('weight')
    result2=(result1.groupby(['userID_x','SEI']).apply(lambda g: 
             cosine_similarity(g['weight'].values.reshape(1, -1), 
             g['artist'].values.reshape(1,-1))[0][0])).rename('CosSim').reset_index()