Search code examples
pythonpandasdataframecosine-similarity

calculate cosine similarity for all columns in a group by in a dataframe


I have a dataframe df: where APer columns range from 0-60

ID    FID   APerc0   ...   APerc60
0     X     0.2      ...   0.5
1     Z     0.1      ...   0.3
2     Y     0.4      ...   0.9
3     X     0.2      ...   0.3
4     Z     0.9      ...   0.1
5     Z     0.1      ...   0.2
6     Y     0.8      ...   0.3
7     W     0.5      ...   0.4
8     X     0.6      ...   0.3

I want to calculate the cosine similarity of the values for all APerc columns between each row. So the result for the above should be:

      ID       CosSim   
1     0,2,4     0.997   
2     1,8,7     0.514    
1     3,5,6     0.925  

I know how to generate cosine similarity for the whole df:

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(df)

But I want to find similarity between each ID and group them together(or create separate df). How to do it fast for big dataset?


Solution

  • One possible solution could be get the particular rows you want to use for cosine similarity computation and do the following.

    Here, combinations is basically the list pair of row index which you want to consider for computation.

    cos = nn.CosineSimilarity(dim=0)
    
    for i in range(len(combinations)):
        row1 = df.loc[combinations[i][0], 2:62]
        row2 = df.loc[combinations[i][1], 2:62]
        sim = cos(row1, row2)
        print(sim)
    

    The result you can use in the way you want.