Search code examples
hierarchical-clustering

HAC: Children in a dendrogram


I ran the HAC on a dataset containing 10 samples as follows:

X_train
>>array([[ 0.97699105,  0.22532681],
         [-0.73247801,  0.60953553],
         [-0.99434933,  0.03124842],
         [-0.82325963,  0.57988328],
         [ 0.50084964, -0.26616097],
         [ 1.94969804,  0.42602413],
         [ 1.0254459 , -0.54057545],
         [-0.57115945,  0.8495053 ],
         [ 1.39201222, -0.34835877],
         [ 0.02372729,  0.52339387]])

Here is the result I get by applying HAC using the scipy library:

linkage(X_train, method='single')
>>array([[ 1.        ,  3.        ,  0.09550162,  2.        ],
         [ 7.        , 10.        ,  0.2891525 ,  3.        ],
         [ 6.        ,  8.        ,  0.41390592,  2.        ],
         [ 2.        , 11.        ,  0.57469287,  4.        ],
         [ 4.        , 12.        ,  0.59203425,  3.        ],
         [ 9.        , 13.        ,  0.67840909,  5.        ],
         [ 0.        , 14.        ,  0.6843032 ,  4.        ],
         [15.        , 16.        ,  0.92251969,  9.        ],
         [ 5.        , 17.        ,  0.95429679, 10.        ]])

Here is the resulting dendrogram

dendrogram(linkage(X_train, method='single'), labels=np.arange(X_train.shape[0]))

Dendrogram

In the output matrix of the linkage(X_train, method='single'), the first two columns represent the children in our hierarchy.

I would like to know how we do to calculate these children?

For example : the first fusion of our algorithm involves singleton clusters containing points {1} and {3}. And as children we have [1, 3] The second merge involves the previously calculated cluster containing the points {1, 3} and the singleton cluster {7}. And like children we have [7, 10]. How was the value 10 obtained?


Solution

  • According to the docs, at the i-th iteration, clusters with indices Z[i, 0] and Z[i, 1] are combined to form cluster n+i, where n is the number of input samples and Z is the linkage matrix. https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html. Thus 10 is just is just 10+0 where 10 is total number of points and 0 is the row where the cluster is combined.

    In other words, all cluster indices i>=n actually refer to the cluster formed in Z[i - n].

    If that's still unclear you can read the detailed description here https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/#Perform-the-Hierarchical-Clustering