python scikit-learn hierarchical-clustering

Hierarchical clustering termination

To my understanding, Agglomerative Hierarchical clustering starts by clustering the points that are closest to each other. I am trying to get the different clustering results where only a certain percentage of the data has been clustered for comparison. i.e. 40%, 50%, 60%...

So I need a way to terminate the hierarchical clustering(ward's) algorithm using sklearn after it has clustered a specified percentage of the data points. For example, stop clustering after 60% of the dataset has been clustered.

Please explain what would be the best way to do this?

Solution

Based on the Scikit-learn documentation:

The AgglomerativeClustering object performs a hierarchical clustering using a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together.

Hence, you can do "early stopping" by defining a number of clusters, and appropriately setting the compute_full_tree parameter (as defined in the API). From the number of clusters obtained when running the algorithm with the full tree computation, you could define ratios of the number of clusters.

It will remain to find the relation between the number of clusters and the proportion of data that has been clustered; but this is probably the most straightforward way to do what you want, without applying changes to the actual Agglomerative Clustering algorithm.