My goal is to perform clustering using DBSCAN from scikit with a precomputed similarity matrix. I have a list with features. I do a pairwise to generate unique pairs for the list and have a function that calculates similarity between pairs. Now I want to transform it to a symmetric matrix that can be used as an input for the clustering algorithm. I think groupby may be helpful, but I am not sure how to go about it. Here is a sample code that gives a list of pairs with distance measure.The id field in the original list is the unique row identifier.
def add_similarity(listdict):
random.seed(10)
newlistdist=[]
for tup_dict in listdict:
newdict={}
tup0=tup_dict[0]
tup1=tup_dict[1]
for key,value in tup0.items():
newdict[key +"_1"]=value
for key,value in tup1.items():
newdict[key+"_2"]=value
newdict["similarity"]=random.random()
newlistdist.append(newdict)
return newlistdist
def generatesymm():
listdict =[{'feature1': 4, 'feature2':2,"id": 100},{'feature1': 3, 'feature2': 2,"id":200},{'feature1': 4, 'feature2':2,"id": 300}]
pairs=list(itertools.combinations(listdict, 2) )
newlistdict=add_similarity(pairs)
If I run this code this gives
[{'id_2': 200, 'feature1_2': 3, 'feature2_2': 2, 'feature2_1': 2, 'feature1_1': 4, 'similarity': 0.571, 'id_1': 100},
{'id_2': 300, 'feature1_2': 4, 'feature2_2': 2, 'feature2_1': 2, 'feature1_1': 4, 'similarity': 0.42, 'id_1': 100},
{'id_2': 300, 'feature1_2': 4, 'feature2_2': 2, 'feature2_1': 2, 'feature1_1': 3, 'similarity': 0.578, 'id_1': 200}]
The output I need
100 200 300
100 1 0.571 0.42
200 0.571 1 0.578
300 0.428 0.578 1
It is not clear to me where id_3
comes from, but below is one way to make your dataframe. The trick is to use numpy to index into the upper and lower triangular portions of the matrix.
In [679]:
import numpy as np
import pandas as pd
similarities = [x["similarity"] for x in newlistdict]
names = ['id_'+str(x) for x in range(1,4)]
n = len(similarities)
iuu = np.mask_indices(3, np.triu, 1)
iul = np.mask_indices(3, np.tril, -1)
mat = np.eye(n)
mat[iuu] = similarities
mat[iul] = similarities
df = pd.DataFrame(mat,columns=names)
df.index = names
df
Out[679]:
id_1 id_2 id_3
id_1 1.000000 0.896082 0.897818
id_2 0.896082 1.000000 0.186298
id_3 0.897818 0.186298 1.000000
(The values differ from your question because I don't know the random seed you used.)