Search code examples
python-3.xpandasdataframesimilarity

Calculate similarity of 1-row dataframe and a large dataframe with the same columns in Python?


I have a very large dataframe (millions of rows) and every time I am getting a 1-row dataframe with the same columns. For example:

df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,-1], 'c': [-1,0.4,31]})
input = pd.DataFrame([[11, -0.44, 4]], columns=list('abc'))

I would like to calculate cosine similarity between the input and the whole df. I am using the following:

from scipy.spatial.distance import cosine
df.apply(lambda row: 1 - cosine(row, input), axis=1)

But it's a bit slow. Tried with swifter package, and it seems to run faster. Please advise what is the best practice for such a task, do it like this or change to another method?


Solution

  • I usually don't do matrix manipulation with DataFrame but with numpy.array. So I will first convert them

    df_npy = df.values
    input_npy = input.values
    

    And then I don't want to use scipy.spatial.distance.cosine so I will take care of the calculation myself, which is to first normalize each of the vectors

    df_npy = df_npy / np.linalg.norm(df_npy, axis=1, keepdims=True)
    input_npy = input_npy / np.linalg.norm(input_npy, axis=1, keepdims=True)
    

    And then matrix multiply them together

    df_npy @ input_npy.T
    

    which will give you

    array([[0.213],
           [0.524],
           [0.431]])
    

    The reason I don't want to use scipy.spatial.distance.cosine is that it only takes care of one pair of vector at a time, but in the way I show, it takes care of all at the same time.