Search code examples
pythonpandasscipyscikit-learndbscan

create a symmetric matrix from a pairwise list python for clustering scikit, DBSCAN


My goal is to perform clustering using DBSCAN from scikit with a precomputed similarity matrix. I have a list with features. I do a pairwise to generate unique pairs for the list and have a function that calculates similarity between pairs. Now I want to transform it to a symmetric matrix that can be used as an input for the clustering algorithm. I think groupby may be helpful, but I am not sure how to go about it. Here is a sample code that gives a list of pairs with distance measure.The id field in the original list is the unique row identifier.

def add_similarity(listdict):
    random.seed(10)
    newlistdist=[]
    for tup_dict in listdict:
        newdict={}
        tup0=tup_dict[0]
        tup1=tup_dict[1]
        for key,value in tup0.items():
            newdict[key +"_1"]=value
        for key,value in tup1.items():
            newdict[key+"_2"]=value 
        newdict["similarity"]=random.random()      
        newlistdist.append(newdict)                   
    return newlistdist


def generatesymm():
    listdict =[{'feature1': 4, 'feature2':2,"id": 100},{'feature1': 3, 'feature2': 2,"id":200},{'feature1': 4, 'feature2':2,"id": 300}]
    pairs=list(itertools.combinations(listdict, 2) )
    newlistdict=add_similarity(pairs)

If I run this code this gives

    [{'id_2': 200, 'feature1_2': 3, 'feature2_2': 2, 'feature2_1': 2, 'feature1_1': 4, 'similarity': 0.571, 'id_1': 100},     


{'id_2': 300, 'feature1_2': 4, 'feature2_2': 2, 'feature2_1': 2, 'feature1_1': 4, 'similarity': 0.42, 'id_1': 100},   


{'id_2': 300, 'feature1_2': 4, 'feature2_2': 2, 'feature2_1': 2, 'feature1_1': 3, 'similarity': 0.578, 'id_1': 200}]

The output I need

          100       200       300


100        1         0.571      0.42  


200        0.571      1          0.578


300        0.428      0.578       1

Solution

  • It is not clear to me where id_3 comes from, but below is one way to make your dataframe. The trick is to use numpy to index into the upper and lower triangular portions of the matrix.

    In [679]:
    import numpy as np
    import pandas as pd
    similarities = [x["similarity"] for x in newlistdict]
    names = ['id_'+str(x) for x in range(1,4)]
    n = len(similarities)
    iuu = np.mask_indices(3, np.triu, 1)
    iul = np.mask_indices(3, np.tril, -1)
    mat = np.eye(n)
    mat[iuu] = similarities
    mat[iul] = similarities
    df = pd.DataFrame(mat,columns=names)
    df.index = names
    df
    
    Out[679]:
            id_1        id_2        id_3
    id_1    1.000000    0.896082    0.897818
    id_2    0.896082    1.000000    0.186298
    id_3    0.897818    0.186298    1.000000
    

    (The values differ from your question because I don't know the random seed you used.)