Search code examples
pythonpython-3.xpandasscikit-learncosine-similarity

Using cosine_similarity function on Python


import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

a = np.array([[3,4],[2,5],[1,2],[1,2],[4,5]])

ap = pd.DataFrame(a, index=['Sonata','Etudes','Waltzes','Nocturnes','Marches'],columns=['search_history','view_count'])
ap

enter image description here

b = np.array([[4,4],[3,5],[2,1],[4,7],[1,2]])
bp = pd.DataFrame(b, index=['Sonata','Etudes','Waltzes','Nocturnes','Marches'],columns=['comment + wishlist ',' signup'])
bp

enter image description here

then i cosine_similarity function ,

from sklearn.metrics.pairwise import cosine_similarity
pd.DataFrame(cosine_similarity(a, b),columns=['A','B'], index=['Sonata','Etudes','Waltzes','Nocturnes','Marches'])

this gives:

ValueError: Shape of passed values is (5, 5), indices imply (5, 2)

so if i change like this,

from sklearn.metrics.pairwise import cosine_similarity
pd.DataFrame(cosine_similarity(a, b),columns=['A','B','c','d','e'], index=['Sonata','Etudes','Waltzes','Nocturnes','Marches'])

enter image description here

This result cames out.

This is not the result I thought. Like dataFrames a and b, i want to show results in five rows and two columns, but we always get results in only five rows and five columns.

What should I do?

expected result was

           A            B   
Sonata     0.989949     0.994692    
Etudes      0.919145    0.987241    
Waltzes     0.948683    0.997054    
Nocturnes   0.948683    0.997054    
Marches    0.993884     0.990992    

like this


Solution

  • cosine_similarity() will compare every value in the array to all the values in the second array, which is 5 * 5 operations and results. You want just the first two columns, so you can slice the result DataFrame

    df = pd.DataFrame(cosine_similarity(a, b), columns=['A', 'B', 'C', 'D', 'E'], index=['Sonata', 'Etudes', 'Waltzes', 'Nocturnes', 'Marches'])
    print(df[['A', 'B']]) # by columns names
    # or
    print(df.iloc[:, 0:2]) # by columns indices
    

    Output

                      A         B
    Sonata     0.989949  0.994692
    Etudes     0.919145  0.987241
    Waltzes    0.948683  0.997054
    Nocturnes  0.948683  0.997054
    Marches    0.993884  0.990992