I have a dataframe df
:
AID VID FID APerc VPerc
1 A X 0.2 0.5
1 A Z 0.1 0.3
1 A Y 0.4 0.9
2 A X 0.2 0.3
2 A Z 0.9 0.1
1 B Z 0.1 0.2
1 B Y 0.8 0.3
1 B W 0.5 0.4
1 B X 0.6 0.3
I want to calculate the cosine similarity of the values APerc
and VPerc
for all pairs of AID
and VID
. So the result for the above should be:
AID VID CosSim
1 A 0.997
2 A 0.514
1 B 0.925
I know how to groupby: df.groupby(['AID','VID'])
and I know how to generate cosine similarity for the whole column:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(df['APerc'], df['VPerc'])
What's the best and fastest way to do this, given I have a really large file.
Not sure if it is the fastest, groupby.apply
is usually the way to do this:
(df.groupby(['AID','VID'])
.apply(lambda g: cosine_similarity(g['APerc'], g['VPerc'])[0][0]))
#AID VID
#1 A 0.997097
# B 0.924917
#2 A 0.514496
#dtype: float64