i have a dataframe as such.
id | action | enc |
---|---|---|
Cell 1 | run,swim,walk | 1,2,3 |
Cell 2 | swim,climb,surf,gym | 2,4,5,6 |
Cell 3 | jog,run] | 7,1 |
This table goes on for roughly 30k rows. After gathering all these actions, and encoding with labelencoder.
I want to create a similarity matrix that i can use for clustering the cells with similar actions together.
i tried using pairwise_distances(df['enc'],metric='jaccard')
but had a setting array element error. Padding it doesnt make much sense to me either.
Is there any way to generate a similarity matrix based on jaccard? thanks.
Step 1, here is your dataFrame
import pandas as pd
data = [['Cell 1', ['run','swim','walk'], [1,2,3]], ['Cell 2', ['swim','climb','surf','gym'], [2,4,5,6]], ['Cell 3', ['jog','run'], [7,1]]]
df = pd.DataFrame(data, columns=['id', 'action', 'label_encoder'])
print(df)
or
import pandas as pd
data = [['Cell 1', 'run,swim,walk', '1,2,3'], ['Cell 2', 'swim,climb,surf,gym', '2,4,5,6'], ['Cell 3', 'jog,run', '7,1']]
df = pd.DataFrame(data, columns=['id', 'action', 'label_encoder'])
df['action'] = df['action'].str.split(',')
df['label_encoder'] = df['label_encoder'].str.split(',')
print(df)
id action label_encoder
0 Cell 1 [run, swim, walk] [1, 2, 3]
1 Cell 2 [swim, climb, surf, gym] [2, 4, 5, 6]
2 Cell 3 [jog, run] [7, 1]
Step 2, add one_hot list as a new column
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
one_hot = mlb.fit_transform(df['label_encoder'])
# add one_hot list as a new column
df['label_encoder_one_hot'] = list(one_hot)
print(df)
id action labelen_coder label_encoder_one_hot
0 Cell 1 [run, swim, walk] [1, 2, 3] [1, 1, 1, 0, 0, 0, 0]
1 Cell 2 [swim, climb, surf, gym] [2, 4, 5, 6] [0, 1, 0, 1, 1, 1, 0]
2 Cell 3 [jog, run] [7, 1] [1, 0, 0, 0, 0, 0, 1]
Step 3, generate the similarity matrix based on Jaccard
from sklearn.metrics import jaccard_score
import numpy as np
similarity_matrix = np.zeros((len(df), len(df)))
for i in range(len(df)):
for j in range(i+1, len(df)):
similarity = jaccard_score(df['label_encoder_one_hot'][i], df['label_encoder_one_hot'][j])
similarity_matrix[i,j] = similarity
similarity_matrix[j,i] = similarity
print(similarity_matrix)
[[0. 0.16666667 0.25 ]
[0.16666667 0. 0. ]
[0.25 0. 0. ]]