python r cluster-analysis hierarchical-clustering unsupervised-learning

How to find optimal number of clusters in hierarchical clustering using Gap statistic?

I want to run hierarchical clustering with single linkage to cluster documents with 300 features and 1500 observations. I want to find the optimal number of clusters for this problem.

The below link uses the below code to find the number of clusters with max gap.

http://www.sthda.com/english/wiki/determining-the-optimal-number-of-clusters-3-must-known-methods-unsupervised-machine-learning

# Compute gap statistic 
set.seed(123)

iris.scaled <- scale(iris[, -5])

gap_stat <- clusGap(iris.scaled, FUN = hcut, K.max = 10, B = 50)

# Plot gap statistic 
fviz_gap_stat(gap_stat)

But in the link hcut is not clearly defined. How can I specify single linkage hierarchical clustering to the clusGap() function?

Do we have an equivalent of clusGap() in python?

Thanks

Solution

The hcut() function is part of the factorextra package used in the link you posted:

hcut package:factoextra R Documentation

Computes Hierarchical Clustering and Cut the Tree

Description:
 Computes hierarchical clustering (hclust, agnes, diana) and cut
 the tree into k clusters. It also accepts correlation based
 distance measure methods such as "pearson", "spearman" and
 "kendall".

R also has a built-in function, hclust(), which can be used to perform hierarchical clustering. By default, however, it does not perform single-linkage clustering, so you can't simply replace hcut with hclust.

If you look at the help for clusGap(), however, you will see that you can provide a custom clustering function to be applied:

FUNcluster: a ‘function’ which accepts as first argument a (data) matrix like ‘x’, second argument, say k, k >= 2, the number of clusters desired, and returns a ‘list’ with a component named (or shortened to) ‘cluster’ which is a vector of length ‘n = nrow(x)’ of integers in ‘1:k’ determining the clustering or grouping of the ‘n’ observations.

The hclust() function is able to perform single-linkage hierarchical clustering, so you can do:

cluster_fun <- function(x, k) list(cluster=cutree(hclust(dist(x), method="single"), k=k))
gap_stat <- clusGap(iris.scaled, FUN=cluster_fun, K.max=10, B=50)