Creating a similarity matrix with jagged arrays

i have a dataframe as such.

id	action	enc
Cell 1	run,swim,walk	1,2,3
Cell 2	swim,climb,surf,gym	2,4,5,6
Cell 3	jog,run]	7,1

This table goes on for roughly 30k rows. After gathering all these actions, and encoding with labelencoder.

I want to create a similarity matrix that i can use for clustering the cells with similar actions together.

i tried using pairwise_distances(df['enc'],metric='jaccard') but had a setting array element error. Padding it doesnt make much sense to me either.

Is there any way to generate a similarity matrix based on jaccard? thanks.

Solution

Step 1, here is your dataFrame

import pandas as pd
data = [['Cell 1', ['run','swim','walk'], [1,2,3]], ['Cell 2', ['swim','climb','surf','gym'], [2,4,5,6]], ['Cell 3', ['jog','run'], [7,1]]]

df = pd.DataFrame(data, columns=['id', 'action', 'label_encoder'])
print(df)

import pandas as pd
data = [['Cell 1', 'run,swim,walk', '1,2,3'], ['Cell 2', 'swim,climb,surf,gym', '2,4,5,6'], ['Cell 3', 'jog,run', '7,1']]
df = pd.DataFrame(data, columns=['id', 'action', 'label_encoder'])
df['action'] = df['action'].str.split(',')
df['label_encoder'] = df['label_encoder'].str.split(',')
print(df)

     id       action                      label_encoder
0   Cell 1  [run, swim, walk]           [1, 2, 3]
1   Cell 2  [swim, climb, surf, gym]    [2, 4, 5, 6]
2   Cell 3  [jog, run]                  [7, 1]

Step 2, add one_hot list as a new column

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()

one_hot = mlb.fit_transform(df['label_encoder'])

# add one_hot list as a new column
df['label_encoder_one_hot'] = list(one_hot)
print(df)


       id                    action  labelen_coder  label_encoder_one_hot
0  Cell 1         [run, swim, walk]     [1, 2, 3]  [1, 1, 1, 0, 0, 0, 0]
1  Cell 2  [swim, climb, surf, gym]  [2, 4, 5, 6]  [0, 1, 0, 1, 1, 1, 0]
2  Cell 3                [jog, run]        [7, 1]  [1, 0, 0, 0, 0, 0, 1]

Step 3, generate the similarity matrix based on Jaccard

from sklearn.metrics import jaccard_score
import numpy as np

similarity_matrix = np.zeros((len(df), len(df)))

for i in range(len(df)):
    for j in range(i+1, len(df)):
        similarity = jaccard_score(df['label_encoder_one_hot'][i], df['label_encoder_one_hot'][j])
        similarity_matrix[i,j] = similarity
        similarity_matrix[j,i] = similarity

print(similarity_matrix)

[[0.         0.16666667 0.25      ]
 [0.16666667 0.         0.        ]
 [0.25       0.         0.        ]]