Search code examples
pythonpandasdataframecosine-similarity

Cosine similarity of rows in pandas DataFrame


I calculated the cosine similarity of a dataframe similar to the following:

ciiu4n4  A0111  A0112  A0113   
 A0111      14      7      6 
 A0112      16     55      3 
 A0113      15      0    112 

using this code:

data_cosine = mpg_data.drop(['ciiu4n4'], axis=1)
result = cosine_similarity(data_cosine)

I get as a result an array like this:

[[ 1.          0.95357118  0.95814892 ]
 [ 0.95357118  1.          0.89993795 ]
 [ 0.95814892  0.89993795  1.         ]]

However, I need the result as a dataframe similar to the original one. I can't do it manually, because the original dataframe is 600 x 600.

The result that I need needs to look something similar like:

ciiu4n4   A0111        A0112        A0113       
 A0111    1.           0.95357118   0.95814892
 A0112    0.95357118   1.           0.89993795
 A0113    0.95814892   0.89993795   1.  

Solution

  • I'd recommend changing your approach slightly. No need to drop any columns. Instead, set the first column as the index, compute cosine similarities, and assign the result array back to the dataframe.

    df = df.set_index('ciiu4n4')
    df
    
             A0111  A0112  A0113
    ciiu4n4                     
    A0111       14      7      6
    A0112       16     55      3
    A0113       15      0    112
    

    v = cosine_similarity(df.values)
    
    df[:] = v
    df.reset_index()
    
      ciiu4n4     A0111     A0112     A0113
    0   A0111  1.000000  0.953571  0.958149
    1   A0112  0.953571  1.000000  0.899938
    2   A0113  0.958149  0.899938  1.000000
    

    The solution above only works when the number of rows and columns (excluding the first) are the same. So, here's another solution that should generalise to any scenario.

    df = df.set_index('ciiu4n4')
    v = cosine_similarity(df.values)
    
    df = pd.DataFrame(v, columns=df.index.values, index=df.index).reset_index()
    df
    
      ciiu4n4     A0111     A0112     A0113
    0   A0111  1.000000  0.953571  0.958149
    1   A0112  0.953571  1.000000  0.899938
    2   A0113  0.958149  0.899938  1.000000
    

    Or, using df.insert -

    df = pd.DataFrame(v, columns=df.index.values)
    df.insert(0, 'ciiu4n4', df.index)
    df
    
      ciiu4n4     A0111     A0112     A0113
    0   A0111  1.000000  0.953571  0.958149
    1   A0112  0.953571  1.000000  0.899938
    2   A0113  0.958149  0.899938  1.000000