I ran the HAC on a dataset containing 10 samples as follows:
X_train
>>array([[ 0.97699105, 0.22532681],
[-0.73247801, 0.60953553],
[-0.99434933, 0.03124842],
[-0.82325963, 0.57988328],
[ 0.50084964, -0.26616097],
[ 1.94969804, 0.42602413],
[ 1.0254459 , -0.54057545],
[-0.57115945, 0.8495053 ],
[ 1.39201222, -0.34835877],
[ 0.02372729, 0.52339387]])
Here is the result I get by applying HAC using the scipy library:
linkage(X_train, method='single')
>>array([[ 1. , 3. , 0.09550162, 2. ],
[ 7. , 10. , 0.2891525 , 3. ],
[ 6. , 8. , 0.41390592, 2. ],
[ 2. , 11. , 0.57469287, 4. ],
[ 4. , 12. , 0.59203425, 3. ],
[ 9. , 13. , 0.67840909, 5. ],
[ 0. , 14. , 0.6843032 , 4. ],
[15. , 16. , 0.92251969, 9. ],
[ 5. , 17. , 0.95429679, 10. ]])
Here is the resulting dendrogram
dendrogram(linkage(X_train, method='single'), labels=np.arange(X_train.shape[0]))
In the output matrix of the linkage(X_train, method='single')
, the first two columns represent the children in our hierarchy.
I would like to know how we do to calculate these children?
For example :
the first fusion of our algorithm involves singleton clusters containing points {1}
and {3}
. And as children we have [1, 3]
The second merge involves the previously calculated cluster containing the points {1, 3}
and the singleton cluster {7}
. And like children we have [7, 10]
. How was the value 10
obtained?
According to the docs, at the i-th iteration, clusters with indices Z[i, 0] and Z[i, 1] are combined to form cluster n+i, where n is the number of input samples and Z is the linkage matrix. https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html. Thus 10 is just is just 10+0 where 10 is total number of points and 0 is the row where the cluster is combined.
In other words, all cluster indices i>=n actually refer to the cluster formed in Z[i - n].
If that's still unclear you can read the detailed description here https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/#Perform-the-Hierarchical-Clustering