Search code examples
pythonpandasvectorizationcosine-similarity

computing cosine similarity in vectorized operation


I am trying to compute cosine similarity between 2D-array.

Let's say I have a dataframe whose shape is (5,4)

df = pd.DataFrame(np.random.randn(20).reshape(5,4), columns=["ref_x", "ref_y", "alt_x", "alt_y"])
df


ref_x   ref_y   alt_x   alt_y
0   2.523641    1.270625    0.127030    0.680601
1   -0.992681   -0.021022   0.461249    0.183311
2   -0.865873   -0.117191   -1.521882   -0.388608
3   -0.081354   -1.852463   -0.086464   0.249440
4   -0.057760   0.023642    0.002147    -1.009961

I know how to compute cosine similarity with scipy.

This gives me the cosine similarity.

df['sim'] = df.apply(lambda row: 1 - spatial.distance.cosine(row[['ref_x', 'ref_y']], row[['alt_x', 'alt_y']]), axis=1)

But it is slow (Actually I have a big dataframe that I would like to compute the similarity)

I want to do something like bellow, but it gives me a "ValueError: Input vector should be 1-D." message

df['sim'] = 1 - spatial.distance.cosine(df[['ref_x', 'ref_y']], df[['alt_x', 'alt_y']])

Does anyone have any suggestion or comments?


Solution

  • use cosine_similarity from sklearn

    from sklearn.metrics.pairwise import cosine_similarity
    
    df = pd.DataFrame(np.random.randn(20).reshape(5,4), columns=["ref_x", "ref_y", "alt_x", "alt_y"])
    co_sim = cosine_similarity(df.to_numpy())
    pd.DataFrame(co_sim)
    

    output:

        0   1   2   3   4
    0   1.000000    0.085483    -0.126060   -0.137558   -0.411323
    1   0.085483    1.000000    -0.447271   -0.277837   0.440389
    2   -0.126060   -0.447271   1.000000    0.309562    -0.306372
    3   -0.137558   -0.277837   0.309562    1.000000    -0.811515
    4   -0.411323   0.440389    -0.306372   -0.811515   1.000000