Search code examples
python-3.xpandasdataframecosine-similarity

Cosine Similarity rows in a dataframe of pandas


I have a CSV file which have content as belows and I want to calculate the cosine similarity from one the remaining ID in the CSV file.

I have load it into a dataframe of pandas as follows:

    old_df['Vector']=old_df.apply(lambda row: 
    np.array(np.matrix(row.Vector)).ravel(), axis = 1) 
    l=[]
    for a in old_df['Vector']:
        l.append(a)
    A=np.array(l)
    similarities = cosine_similarity(A)

The output looks fine. However, i do not know how to find which the GUID (or ID)similar to other GUID (or ID), and I only want to get the top k have highest similar score.

Could you pls help me to solve this issue.

Thank you.

|Index  |  GUID | Vector                                |
|-------|-------|---------------------------------------|
|36099  | b770  |[-0.04870541 -0.02133574  0.03180726]  |
|36098  | 808f  |[  0.0732905  -0.05331331  0.06378368] |
|36097  | b111  |[ 0.01994788  0.00417582 -0.09615131]  |
|36096  | b6b5  |[0.025697   -0.08277534 -0.0124591]    |
|36083  | 9b07  |[ 0.025697   -0.08277534 -0.0124591]   |
|36082  | b9ed  |[-0.00952298  0.06188576 -0.02636449]  |
|36081  | a5b6  |[0.00432161  0.02264584 -0.0341924]    |
|36080  | 9891  |[ 0.08732156  0.00649456 -0.02014138]  |
|36079  | ba40  |[0.05407356 -0.09085554 -0.07671648]   |
|36078  | 9dff  |[-0.09859556  0.04498474 -0.01839088]  |
|36077  | a423  |[-0.06124249  0.06774347 -0.05234318]  |
|36076  | 81c4  |[0.07278682 -0.10460124 -0.06572364]   |
|36075  | 9f88  |[0.09830415  0.05489364 -0.03916228]   |
|36074  | adb8  |[0.03149953 -0.00486591  0.01380711]   |
|36073  | 9765  |[0.00673934  0.0513557  -0.09584251]   |
|36072  | aff4  |[-0.00097896  0.0022945   0.01643319]  |

Solution

  • Example code to get top k cosine similarities and they corresponding GUID and row ID:

    import numpy as np
    import pandas as pd
    from sklearn.metrics.pairwise import cosine_similarity
    
    data = {"GUID": ["b770", "808f", "b111"], "Vector": [[-0.1, -0.2, 0.3], [0.1, -0.2, -0.3], [-0.1, 0.2, -0.3]]}
    df = pd.DataFrame(data)
    print("Data: \n{}\n".format(df))
    
    vectors = []
    for v in df['Vector']:
        vectors.append(v)
    vectors_num = len(vectors)
    A=np.array(vectors)
    # Get similarities matrix
    similarities = cosine_similarity(A)
    similarities[np.tril_indices(vectors_num)] = -2
    print("Similarities: \n{}\n".format(similarities))
    
    k = 2
    if k > vectors_num:
        K = vectors_num
    # Get top k similarities and pair GUID in ascending order
    top_k_indexes = np.unravel_index(np.argsort(similarities.ravel())[-k:], similarities.shape)
    top_k_similarities = similarities[top_k_indexes]
    top_k_pair_GUID = []
    for indexes in top_k_indexes:
        pair_GUID = (df.iloc[indexes[0]]["GUID"], df.iloc[indexes[1]]["GUID"])
        top_k_pair_GUID.append(pair_GUID)
    
    print("top_k_indexes: \n{}\ntop_k_pair_GUID: \n{}\ntop_k_similarities: \n{}".format(top_k_indexes, top_k_pair_GUID, top_k_similarities))
    

    Outputs:

    Data:
       GUID             Vector
    0  b770  [-0.1, -0.2, 0.3]
    1  808f  [0.1, -0.2, -0.3]
    2  b111  [-0.1, 0.2, -0.3]
    
    Similarities:
    [[-2.         -0.42857143 -0.85714286] 
     [-2.         -2.          0.28571429] 
     [-2.         -2.         -2.        ]]
    
    top_k_indexes:
    (array([0, 1], dtype=int64), array([1, 2], dtype=int64))
    top_k_pair_GUID:
    [('b770', '808f'), ('808f', 'b111')]
    top_k_similarities:
    [-0.42857143  0.28571429]