Search code examples
pythonpandasscikit-learnscipycosine-similarity

How can I find cosine similarity between input array and pandas dataframe and return the row in dataframe which is most similar?


I have a data set as shown below and I want to find the cosine similarity between input array and reach row in dataframe in order to identify the row which is most similar or duplicate. The data shown below is a sample and has multiple features. I want to find the cosine similarity between input row and each row in the data use the min(argmin) enter image description here


Solution

  • There are various ways of computing cosine similarity. Here I give a brief summary on how each of them applies to a dataframe.

    Data

    import pandas as pd
    import numpy as np
    
    # Please don't make people do this. You should have enough reps to know that.
    np.random.seed(111)  # reproducibility
    df = pd.DataFrame(
        data={
            "col1": np.random.randn(5),
            "col2": np.random.randn(5),
            "col3": np.random.randn(5),
        }
    )
    
    input_array = np.array([1,2,3])
    
    # print
    df
    Out[6]: 
           col1      col2      col3
    0 -1.133838 -0.459439  0.238894
    1  0.384319 -0.059169 -0.589920
    2  1.496554 -0.354174 -1.440585
    3 -0.355382 -0.735523  0.773703
    4 -0.787534 -1.183940 -1.027967
    

    1. Sklearn cosine_similarity

    Just mind the correct shape. 2D data should always be shaped as(#rows, #features). Also mind the output shape.

    from sklearn.metrics.pairwise import cosine_similarity
    cosine_similarity(input_array.reshape((1, -1)), df).reshape(-1)
    Out[7]: array([-0.28645981, -0.56882572, -0.44816313,  0.11750604, -0.95037169])
    

    2. Scipy cosine distance

    Just apply this on each row (axis=1). The result is the same as using sklearn. Note that cosine similarity is 1 - cosine(a1, a2) here.

    from scipy.spatial.distance import cosine
    df.apply(lambda row: 1 - cosine(row, input_array), axis=1)
    Out[10]: 
    0   -0.286460
    1   -0.568826
    2   -0.448163
    3    0.117506
    4   -0.950372
    dtype: float64
    

    3. Compute manually

    Essentially the same as scipy except that you code the formula manually.

    from numpy.linalg import norm
    df.apply(lambda row: input_array.dot(row) / norm(input_array) / norm(row), axis=1)
    Out[8]: 
    0   -0.286460
    1   -0.568826
    2   -0.448163
    3    0.117506
    4   -0.950372
    dtype: float64
    

    Also refer to the relation between Pearson correlation, cosine similarity and z-score to see whether it is helpful.