I have a data set as shown below and I want to find the cosine similarity between input array and reach row in dataframe in order to identify the row which is most similar or duplicate.
The data shown below is a sample and has multiple features. I want to find the cosine similarity between input row and each row in the data use the min(argmin)
There are various ways of computing cosine similarity. Here I give a brief summary on how each of them applies to a dataframe.
import pandas as pd
import numpy as np
# Please don't make people do this. You should have enough reps to know that.
np.random.seed(111) # reproducibility
df = pd.DataFrame(
data={
"col1": np.random.randn(5),
"col2": np.random.randn(5),
"col3": np.random.randn(5),
}
)
input_array = np.array([1,2,3])
# print
df
Out[6]:
col1 col2 col3
0 -1.133838 -0.459439 0.238894
1 0.384319 -0.059169 -0.589920
2 1.496554 -0.354174 -1.440585
3 -0.355382 -0.735523 0.773703
4 -0.787534 -1.183940 -1.027967
Just mind the correct shape. 2D data should always be shaped as(#rows, #features)
. Also mind the output shape.
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(input_array.reshape((1, -1)), df).reshape(-1)
Out[7]: array([-0.28645981, -0.56882572, -0.44816313, 0.11750604, -0.95037169])
Just apply this on each row (axis=1
). The result is the same as using sklearn
. Note that cosine similarity is 1 - cosine(a1, a2)
here.
from scipy.spatial.distance import cosine
df.apply(lambda row: 1 - cosine(row, input_array), axis=1)
Out[10]:
0 -0.286460
1 -0.568826
2 -0.448163
3 0.117506
4 -0.950372
dtype: float64
Essentially the same as scipy
except that you code the formula manually.
from numpy.linalg import norm
df.apply(lambda row: input_array.dot(row) / norm(input_array) / norm(row), axis=1)
Out[8]:
0 -0.286460
1 -0.568826
2 -0.448163
3 0.117506
4 -0.950372
dtype: float64
Also refer to the relation between Pearson correlation, cosine similarity and z-score to see whether it is helpful.