I need to calculate pairwise cosine_similarity
for a pandas dataframe and store it back into another dataframe(Pandas).
As of now, I calculate similarity using sklearn.metrics.pairwise
's cosine_similarity
:
sim = cosine_similarity(df,dense_output=False)
sample from sim
:
[[1.00000000 8.33333333 ... 8.72871561 8.72871561 8.72871561]
[8.33333333 1.00000000 ... 7.63762616 7.63762616 7.63762616]]
Now I wish to store it back into a Pandas dataframe with this structure:
ID Pair_ID Sim_Value
1 1 1.00
1 2 8.33
.
.
.
.
2 1 8.33
How can I do that?
Create the indices then the dataframe
i, j = np.indices(sim.shape).reshape(2, -1)
mask = i != j
i = i[mask]
j = j[mask]
pd.DataFrame({
'ID': df.index[i],
'Pair_ID': df.index[j],
'Sim_Value': sim[i, j]
})