Search code examples
scipyhierarchical-clusteringlinkage

scipy linkage with a given distance matrix


I have a very large sparse matrix (few million rows, 500 columns). I have already cumputed a distance matrix of 5000X5000. I need to use scipy.cluster.hierarchy.linkage to get the clustering according to this matrix. I know that linkage accepts a custom function, but computing this distance matrix again is very time consuming.
How can I tell scipy to use the distances by the matrix? I tried

dist = my_dist(X) # numpy array ndim = 2
linkage(X, metric=lambda x: dist[x,y])

but the x,y passed are the values and not the indexes.


Solution

  • You can pass the distance matrix to linkage if you represent it as a "condensed" distance matrix. You can use scipy.spatial.squareform to convert dist to the condensed representation.

    Something like this:

    from scipy.spatial.distance import squareform
    
    dist = my_dist(X)
    condensed_dist = squareform(dist)
    linkresult = linkage(condensed_dist)