Search code examples
pythonpandascosine-similarity

Pandas: Cosine similarity for each rows


I have dataframe:

import pandas as pd
data = [['apple', 'one', 0.0, [0.047668457, -0.04888916]], ['banana', 'two', 0.0 , [0.0287323, -0.037841797] ], ['qiwi', 'three', 0.0, [0.031051636, -0.05227661]],
        ['orange', 'one', 1.0, [0.0020618439, -0.055389404]], ['mango', 'two', 1.0, [0.0030326843, -0.036193848]], ['strawberry', 'three', 1.0, [0.008613586, -0.06561279]]]

df = pd.DataFrame(data, columns=['word', 'group', 'count', 'vec'])
----------+-----+-----+--------------------+----------+
|      word|group|count|                 vec|     word2|
+----------+-----+-----+--------------------+----------+
|     apple|  one|  0.0|[0.047668457, -0....|     apple|
|    banana|  two|  0.0|[0.0287323, -0.03...|    banana|
|      qiwi|three|  0.0|[0.031051636, -0....|      qiwi|
|    orange|  one|  1.0|[0.0020618439, -0...|    orange|
|     mango|  two|  1.0|[0.0030326843, -0...|     mango|
|strawberry|three|  1.0|[0.008613586, -0....|strawberry|
+----------+-----+-----+--------------------+----------+

I want to create a 5x5 dataframe where the cosine similarity of each row will be calculated. Result look like this(I showed only 2 lines in the example):

   +------+----------+----------+------------------+------------------+------------------+------------------+
    |  word|     apple|    banana|              qiwi|            orange|             mango|        strawberry|
    +------+----------+----------+------------------+------------------+------------------+------------------+
    | apple|       1.0|0.99240247|0.9721006775103194|0.7414623055821596|0.7414623055821596|0.8007656107780402|
    |banana|0.99240247|       1.0|        0.99357443|        0.81838407|        0.84415172|          0.868376|
    +------+----------+----------+------------------+------------------+------------------+------------------+
                                   ...........................

I tried this, but i dont know how to fill all None:

df['word2'] = df['word']
df_piv = df.pivot_table(index=['word'], columns='word2',
                         values='vec', aggfunc='first').reset_index()
# calc cos sim
# df2 = df_piv .set_index('word')
# v = cosine_similarity(df2.values)

# done = pd.DataFrame(v, columns=df2.index.values, index=df2.index).reset_index()
        +----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
        |      word|               apple|              banana|               mango|              orange|                qiwi|          strawberry|
        +----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
        |     apple|[0.047668457, -0....|                null|                null|                null|                null|                null|
        |    banana|                null|[0.0287323, -0.03...|                null|                null|                null|                null|
        |     mango|                null|                null|[0.0030326843, -0...|                null|                null|                null|
        |    orange|                null|                null|                null|[0.0020618439, -0...|                null|                null|
        |      qiwi|                null|                null|                null|                null|[0.031051636, -0....|                null|
        |strawberry|                null|                null|                null|                null|                null|[0.008613586, -0....|
        +----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+

Solution

  • You can also use sklearn cosine_similarity module:

    from sklearn.metrics.pairwise import cosine_similarity
    
    vectors = df['vec'].to_list()
    pd.DataFrame(cosine_similarity(vectors, vectors), 
                 index=df['word'], columns=df['word'])
    

    Output would be:

    word    apple   banana  qiwi    orange  mango   strawberry
    word                        
    apple   1.000000    0.992402    0.972101    0.741462    0.771779    0.800766
    banana  0.992402    1.000000    0.993574    0.818384    0.844152    0.868376
    qiwi    0.972101    0.993574    1.000000    0.878167    0.899404    0.918923
    orange  0.741462    0.818384    0.878167    1.000000    0.998924    0.995648
    mango   0.771779    0.844152    0.899404    0.998924    1.000000    0.998899
    strawberry  0.800766    0.868376    0.918923    0.995648    0.998899    1.000000