Search code examples
pythondata-sciencecosine-similarity

cosine_similarity giving different answer for dataframe and subset of dataframe


I have the following piece of code for my recommendation system and it gives different output.

Scenario 1:

a = df[df.index == 5031]
b = df[df.index == 9365]

print(cosine_similarity(a,b)) #0.33

Scenario 2:

cosine_sim = cosine_similarity(df)

print(cosine_sim[5031][9365]) #0.25

I think the output for both scenarios should be the same. I feel scenario 1 to be more accurate according to the data. Can anyone help with this?

Dataframe looks like this.


Solution

  • You are mixing label index with location based index.

    In scenario 1 you get the vectors by label index

    # labels 5031 and 9365
    a = df[df.index == 5031]
    b = df[df.index == 9365]
    

    The matrix which is returned by sklearn.metrics.pairwise.cosine_similarity does not know anything about the index labels. Thus before you get the data from the matrix you need to know the location based index in the dataframe

    idx_a = df.index.get_loc(5031)
    idx_b = df.index.get_loc(9365)
    cosine_sim[idx_a][idx_b]