Search code examples
pythonpandassimilarity

Creating a similarity matrix with jagged arrays


i have a dataframe as such.

id action enc
Cell 1 run,swim,walk 1,2,3
Cell 2 swim,climb,surf,gym 2,4,5,6
Cell 3 jog,run] 7,1

This table goes on for roughly 30k rows. After gathering all these actions, and encoding with labelencoder.

I want to create a similarity matrix that i can use for clustering the cells with similar actions together.

i tried using pairwise_distances(df['enc'],metric='jaccard') but had a setting array element error. Padding it doesnt make much sense to me either.

Is there any way to generate a similarity matrix based on jaccard? thanks.


Solution

  • Step 1, here is your dataFrame

    import pandas as pd
    data = [['Cell 1', ['run','swim','walk'], [1,2,3]], ['Cell 2', ['swim','climb','surf','gym'], [2,4,5,6]], ['Cell 3', ['jog','run'], [7,1]]]
    
    df = pd.DataFrame(data, columns=['id', 'action', 'label_encoder'])
    print(df)
    
    

    or

    import pandas as pd
    data = [['Cell 1', 'run,swim,walk', '1,2,3'], ['Cell 2', 'swim,climb,surf,gym', '2,4,5,6'], ['Cell 3', 'jog,run', '7,1']]
    df = pd.DataFrame(data, columns=['id', 'action', 'label_encoder'])
    df['action'] = df['action'].str.split(',')
    df['label_encoder'] = df['label_encoder'].str.split(',')
    print(df)
    
    
    
         id       action                      label_encoder
    0   Cell 1  [run, swim, walk]           [1, 2, 3]
    1   Cell 2  [swim, climb, surf, gym]    [2, 4, 5, 6]
    2   Cell 3  [jog, run]                  [7, 1]
    

    Step 2, add one_hot list as a new column

    from sklearn.preprocessing import MultiLabelBinarizer
    mlb = MultiLabelBinarizer()
    
    one_hot = mlb.fit_transform(df['label_encoder'])
    
    # add one_hot list as a new column
    df['label_encoder_one_hot'] = list(one_hot)
    print(df)
    
    
           id                    action  labelen_coder  label_encoder_one_hot
    0  Cell 1         [run, swim, walk]     [1, 2, 3]  [1, 1, 1, 0, 0, 0, 0]
    1  Cell 2  [swim, climb, surf, gym]  [2, 4, 5, 6]  [0, 1, 0, 1, 1, 1, 0]
    2  Cell 3                [jog, run]        [7, 1]  [1, 0, 0, 0, 0, 0, 1]
    
    

    Step 3, generate the similarity matrix based on Jaccard

    from sklearn.metrics import jaccard_score
    import numpy as np
    
    similarity_matrix = np.zeros((len(df), len(df)))
    
    for i in range(len(df)):
        for j in range(i+1, len(df)):
            similarity = jaccard_score(df['label_encoder_one_hot'][i], df['label_encoder_one_hot'][j])
            similarity_matrix[i,j] = similarity
            similarity_matrix[j,i] = similarity
    
    print(similarity_matrix)
    
    [[0.         0.16666667 0.25      ]
     [0.16666667 0.         0.        ]
     [0.25       0.         0.        ]]