Search code examples
pythonpandastrigonometry

Can I use cosine similarity between rows using only non null values?


I want to find the cosine similarity (or euclidean distance if easier) between one query row, and 10 other rows. These rows are full of nan values, so if a column is nan they are to be ignored.

For example, query :

A   B   C   D   E   F
3   2  NaN  5  NaN  4

df =

A   B   C   D   E   F
2   1   3  NaN  4   5
1  NaN  2   4  NaN  3
.   .   .   .   .   .
.   .   .   .   .   .

So I just want to get the cosine similarity between every non null column that query and the rows from df have in column. So for row 0 in df A, B, and F are non null in both query and df.

I then want to print the cosine similarity for each row.

Thanks in advance


Solution

  • The simplest method I can think of is to use sklearn's cosine_similarity.

    from sklearn.metrics.pairwise import cosine_similarity
    cosine_similarity(df.fillna(0), df1.fillna(0))
    # array([[0.51378309],
    #        [0.86958199]])
    

    The easiest way to "ignore" NaNs is to just treat them as zeros when computing similarity.