python pandas scikit-learn scipy cosine-similarity

How can I find cosine similarity between input array and pandas dataframe and return the row in dataframe which is most similar?

I have a data set as shown below and I want to find the cosine similarity between input array and reach row in dataframe in order to identify the row which is most similar or duplicate. The data shown below is a sample and has multiple features. I want to find the cosine similarity between input row and each row in the data use the min(argmin)

Solution

There are various ways of computing cosine similarity. Here I give a brief summary on how each of them applies to a dataframe.

Data

import pandas as pd
import numpy as np

# Please don't make people do this. You should have enough reps to know that.
np.random.seed(111)  # reproducibility
df = pd.DataFrame(
    data={
        "col1": np.random.randn(5),
        "col2": np.random.randn(5),
        "col3": np.random.randn(5),
    }
)

input_array = np.array([1,2,3])

# print
df
Out[6]: 
       col1      col2      col3
0 -1.133838 -0.459439  0.238894
1  0.384319 -0.059169 -0.589920
2  1.496554 -0.354174 -1.440585
3 -0.355382 -0.735523  0.773703
4 -0.787534 -1.183940 -1.027967

1. Sklearn cosine_similarity

Just mind the correct shape. 2D data should always be shaped as(#rows, #features). Also mind the output shape.

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(input_array.reshape((1, -1)), df).reshape(-1)
Out[7]: array([-0.28645981, -0.56882572, -0.44816313,  0.11750604, -0.95037169])

2. Scipy cosine distance

Just apply this on each row (axis=1). The result is the same as using sklearn. Note that cosine similarity is 1 - cosine(a1, a2) here.

from scipy.spatial.distance import cosine
df.apply(lambda row: 1 - cosine(row, input_array), axis=1)
Out[10]: 
0   -0.286460
1   -0.568826
2   -0.448163
3    0.117506
4   -0.950372
dtype: float64

3. Compute manually

Essentially the same as scipy except that you code the formula manually.

from numpy.linalg import norm
df.apply(lambda row: input_array.dot(row) / norm(input_array) / norm(row), axis=1)
Out[8]: 
0   -0.286460
1   -0.568826
2   -0.448163
3    0.117506
4   -0.950372
dtype: float64

Also refer to the relation between Pearson correlation, cosine similarity and z-score to see whether it is helpful.