Search code examples
pythonrcluster-analysishierarchical-clusteringunsupervised-learning

How to find optimal number of clusters in hierarchical clustering using Gap statistic?


I want to run hierarchical clustering with single linkage to cluster documents with 300 features and 1500 observations. I want to find the optimal number of clusters for this problem.

The below link uses the below code to find the number of clusters with max gap.

http://www.sthda.com/english/wiki/determining-the-optimal-number-of-clusters-3-must-known-methods-unsupervised-machine-learning

# Compute gap statistic 
set.seed(123)

iris.scaled <- scale(iris[, -5])

gap_stat <- clusGap(iris.scaled, FUN = hcut, K.max = 10, B = 50)

# Plot gap statistic 
fviz_gap_stat(gap_stat)

But in the link hcut is not clearly defined. How can I specify single linkage hierarchical clustering to the clusGap() function?

Do we have an equivalent of clusGap() in python?

Thanks


Solution

  • The hcut() function is part of the factorextra package used in the link you posted:

    hcut package:factoextra R Documentation

    Computes Hierarchical Clustering and Cut the Tree

    Description:

     Computes hierarchical clustering (hclust, agnes, diana) and cut
     the tree into k clusters. It also accepts correlation based
     distance measure methods such as "pearson", "spearman" and
     "kendall".
    

    R also has a built-in function, hclust(), which can be used to perform hierarchical clustering. By default, however, it does not perform single-linkage clustering, so you can't simply replace hcut with hclust.

    If you look at the help for clusGap(), however, you will see that you can provide a custom clustering function to be applied:

    FUNcluster: a ‘function’ which accepts as first argument a (data) matrix like ‘x’, second argument, say k, k >= 2, the number of clusters desired, and returns a ‘list’ with a component named (or shortened to) ‘cluster’ which is a vector of length ‘n = nrow(x)’ of integers in ‘1:k’ determining the clustering or grouping of the ‘n’ observations.

    The hclust() function is able to perform single-linkage hierarchical clustering, so you can do:

    cluster_fun <- function(x, k) list(cluster=cutree(hclust(dist(x), method="single"), k=k))
    gap_stat <- clusGap(iris.scaled, FUN=cluster_fun, K.max=10, B=50)