Search code examples
pythonpandasdataframecosine-similarity

calculate cosine similarity for two columns in a group by in a dataframe


I have a dataframe df:

AID   VID   FID   APerc   VPerc
1     A     X     0.2     0.5
1     A     Z     0.1     0.3
1     A     Y     0.4     0.9
2     A     X     0.2     0.3
2     A     Z     0.9     0.1
1     B     Z     0.1     0.2
1     B     Y     0.8     0.3
1     B     W     0.5     0.4
1     B     X     0.6     0.3

I want to calculate the cosine similarity of the values APerc and VPerc for all pairs of AID and VID. So the result for the above should be:

AID   VID   CosSim   
1     A     0.997   
2     A     0.514    
1     B     0.925     

I know how to groupby: df.groupby(['AID','VID'])

and I know how to generate cosine similarity for the whole column:

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(df['APerc'], df['VPerc'])

What's the best and fastest way to do this, given I have a really large file.


Solution

  • Not sure if it is the fastest, groupby.apply is usually the way to do this:

    (df.groupby(['AID','VID'])
       .apply(lambda g: cosine_similarity(g['APerc'], g['VPerc'])[0][0]))
    
    #AID  VID
    #1    A      0.997097
    #     B      0.924917
    #2    A      0.514496
    #dtype: float64