I have the following piece of code for my recommendation system and it gives different output.
Scenario 1:
a = df[df.index == 5031]
b = df[df.index == 9365]
print(cosine_similarity(a,b)) #0.33
Scenario 2:
cosine_sim = cosine_similarity(df)
print(cosine_sim[5031][9365]) #0.25
I think the output for both scenarios should be the same. I feel scenario 1 to be more accurate according to the data. Can anyone help with this?
Dataframe looks like this.
You are mixing label index with location based index.
In scenario 1 you get the vectors by label index
# labels 5031 and 9365
a = df[df.index == 5031]
b = df[df.index == 9365]
The matrix which is returned by sklearn.metrics.pairwise.cosine_similarity
does not know anything about the index labels.
Thus before you get the data from the matrix you need to know the location based index in the dataframe
idx_a = df.index.get_loc(5031)
idx_b = df.index.get_loc(9365)
cosine_sim[idx_a][idx_b]