beginner @ Python here. I have a pandas DataFrame df with the columns: userID, weight, SEI, name.
#libraries
import numpy as np; import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
#dataframe
userID weight SEI name
3 125.0. 0.562140 263
4 254.0. 0.377294 869
5 451.0. 0.872896 196
1429 451.0. 0.872896 196
5 129.0. 0.569432 582
... ... ... ...
#output
cosine_similarity(df)
array([[1. , 0.98731894, 0.75370844, ..., 0.33814175, 0.33700687, 0.24443919],
[0.98731894, 1. , 0.63987877, ..., 0.35037059, 0.34963404, 0.23870279],
[0.75370844, 0.63987877, 1. , ..., 0.16648431, 0.16403693, 0.17438159],
...,
The person with userID 3 has a weight of 125.0, and SEI of 0.562140. The person with name 263 also has a weight of 125.0, and SEI of 0.562140. (I had to use a label encoder for the name column because I could not run the cosine similarity function without changing the column data type. Hopefully this doesn't affect the end goal?)
The goal is to match up values from the column userID to values in the column name using cosine similarity on all rows. I just need some guidance in interpreting the output in order to do this. All I know is the higher the cosine value the greater the similarity.
Any help is appreciated!
Make it easier for yourself and group by two columns
result1=df.sort_values('weight')
result2=(result1.groupby(['userID_x','SEI']).apply(lambda g:
cosine_similarity(g['weight'].values.reshape(1, -1),
g['artist'].values.reshape(1,-1))[0][0])).rename('CosSim').reset_index()