Search code examples
pythonscipydendrogramhcluster

Dendrogram through scipy given a similarity matrix


I have computed a jaccard similarity matrix with Python. I want to cluster highest similarities to lowest, however, no matter what linkage function I use it produces the same dendrogram! I have a feeling that the function assumes that my matrix is of original data, but I have already computed the first similarity matrix. Is there any way to pass this similarity matrix through to the dendrogram so it plots correctly? Or am I going to have to output the matrix and simply do it with R. Passing through the original raw data is not possible, as I am computing similarities of words. Thanks for the help!

Here is some code:

SimMatrix = [[ 0.,0.09259259,  0.125     ,  0.        ,  0.08571429],
   [ 0.09259259,  0.        ,  0.05555556,  0.        ,  0.05128205],
   [ 0.125     ,  0.05555556,  0.        ,  0.03571429,  0.05882353],
   [ 0.        ,  0.        ,  0.03571429,  0.        ,  0.        ],
   [ 0.08571429,  0.05128205,  0.05882353,  0.        ,  0.        ]]

linkage = hcluster.complete(SimMatrix) #doesnt matter what linkage...
dendro  = hcluster.dendrogram(linkage) #same plot for all types?
show()

If you run this code, you will see a dendrogram that is completely backwards. No matter what linkage type I use, it produces the same dendrogram. This intuitively can not be correct!


Solution

  • Here's the solution. Turns out the SimMatrix needs to be first converted into a condensed matrix (the diagonal, upper right or bottom left, of this matrix). You can see this in the code below:

    import scipy.spatial.distance as ssd
    distVec = ssd.squareform(SimMatrix)
    linkage = hcluster.linkage(1 - distVec)
    dendro  = hcluster.dendrogram(linkage)
    show()