python pandas dataframe cosine-similarity

calculate cosine similarity for two columns in a group by in a dataframe

I have a dataframe df:

AID   VID   FID   APerc   VPerc
1     A     X     0.2     0.5
1     A     Z     0.1     0.3
1     A     Y     0.4     0.9
2     A     X     0.2     0.3
2     A     Z     0.9     0.1
1     B     Z     0.1     0.2
1     B     Y     0.8     0.3
1     B     W     0.5     0.4
1     B     X     0.6     0.3

I want to calculate the cosine similarity of the values APerc and VPerc for all pairs of AID and VID. So the result for the above should be:

AID   VID   CosSim   
1     A     0.997   
2     A     0.514    
1     B     0.925

I know how to groupby: df.groupby(['AID','VID'])

and I know how to generate cosine similarity for the whole column:

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(df['APerc'], df['VPerc'])

What's the best and fastest way to do this, given I have a really large file.

Solution

Not sure if it is the fastest, groupby.apply is usually the way to do this:

(df.groupby(['AID','VID'])
   .apply(lambda g: cosine_similarity(g['APerc'], g['VPerc'])[0][0]))

#AID  VID
#1    A      0.997097
#     B      0.924917
#2    A      0.514496
#dtype: float64