Search code examples
pythonpython-3.xcosine-similarity

Compare a list with the rows in pandas using Cosine similarity and get the rank


I have a Pandas Dataframe and a user input , i would require to compare the user input with each of the rows in the dataframe and get the Ranked list of rows in the dataframe based on Cosine Similarties.

Department  Country Age Grade   Score
Math    India   Young   A   97
Math    India   Young   B   86
Math    India   Young   D   68
Science India   Young   A   92
Science India   Young   B   81
Science India   Young   C   76
Social  India   Young   B   88
Social  India   Young   D   62
Social  India   Young   C   72

User input :

Country Age Grade   Score
India   Young   B   84
India   Young   D   65
India   Young   A   98

I would prefer to consider all the rows of the dataframe as lists, and consider the User input as list. Say User_list1 = ['India','Young','B','84']and compare it using Cosine Similarlity with each rows of the dataframe (considering them as a list) and get the Ranked output of Department.

In my case, the output will be the Ranked list of Department : Out = ['Math','Science','Social'] : This should based on Cosine Similarity results.


Solution

  • Considering both of dataframes as above,

    df
       Department Country Age Grade Score
    0   Math    India   Young   A   97
    1   Math    India   Young   B   86
    2   Math    India   Young   D   68
    3   Science India   Young   A   92
    4   Science India   Young   B   81
    5   Science India   Young   C   76
    6   Social  India   Young   B   88
    7   Social  India   Young   D   62
    8   Social  India   Young   C   72
    
    input
    
    Country Age Grade   Score
    0   India   Young   B   84
    1   India   Young   D   65
    2   India   Young   A   98
    

    One of possible solution is,

    from sklearn import preprocessing
    le = preprocessing.LabelEncoder()
    import numpy as np
    from collections import OrderedDict
    import sys
    

    Convert categorical features to numeric using scikit-learn package,

    df['Country'] = le.fit_transform(df['Country'])
    df['Age'] = le.fit_transform(df['Age'])
    df['Grade'] = le.fit_transform(df['Grade'])
    df
    

    Output:

    Department Country Age Grade    Score
    0       Math    0      0    0   97
    1       Math    0      0    1   86
    2       Math    0      0    3   68
    3      Science  0      0    0   92
    4      Science  0      0    1   81
    5      Science  0      0    2   76
    6      Social   0      0    1   88
    7      Social   0      0    3   62
    8      Social   0      0    2   72
    
    input['Country'] = le.fit_transform(input['Country'])
    input['Age'] = le.fit_transform(input['Age'])
    input['Grade'] = le.fit_transform(input['Grade'])
    input
    

    Output:

     Country  Age   Grade  Score
    0   0       0     1     84
    1   0       0     2     65
    2   0       0     0     98
    

    Define a cosine-similarity function,

    def cosine_similarity(a, b):
        nom = np.sum(np.multiply(a, b))
        denom = np.sqrt(np.sum(np.square(a))) * np.sqrt(np.sum(np.square(b)))
        sim = nom / denom
        return sim
    
    dept = list(df['Department'].values)
    dept = list(OrderedDict.fromkeys(dept).keys())
    results = []
    for i in range(len(input)):
        similarity = []
        for j in range(len(df)):
            a = input.iloc[i] 
            b = df.iloc[j, 1:]
            c_sim = cosine_similarity(a, b)
            similarity.append(c_sim)
    
        max_similarity = []
        for k in range(0, len(df), 3):
            max_3 = max(similarity[k:k+3])
            max_similarity.append(max_3)
    
        max_idx = max_similarity.index(max(max_similarity))
        results.append(dept[max_idx])
    results
    

    Output:

    ['Math', 'Social', 'Math']