Search code examples
pythonpandascosine-similarity

Get pairwise cosine similarity in pandas dataframe


I need to calculate pairwise cosine_similarity for a pandas dataframe and store it back into another dataframe(Pandas).

As of now, I calculate similarity using sklearn.metrics.pairwise's cosine_similarity:

sim = cosine_similarity(df,dense_output=False)

sample from sim:

[[1.00000000 8.33333333 ... 8.72871561 8.72871561 8.72871561]
 [8.33333333 1.00000000 ... 7.63762616 7.63762616 7.63762616]]

Now I wish to store it back into a Pandas dataframe with this structure:

ID  Pair_ID  Sim_Value
1   1        1.00
1   2        8.33
.
.
.
.
2   1        8.33

How can I do that?


Solution

  • Create the indices then the dataframe

    i, j = np.indices(sim.shape).reshape(2, -1)
    mask = i != j
    i = i[mask]
    j = j[mask]
    
    pd.DataFrame({
        'ID': df.index[i],
        'Pair_ID': df.index[j],
        'Sim_Value': sim[i, j]
    })