I want to run hierarchical clustering with single linkage to cluster documents with 300 features and 1500 observations. I want to find the optimal number of clusters for this problem.
The below link uses the below code to find the number of clusters with max gap.
# Compute gap statistic
set.seed(123)
iris.scaled <- scale(iris[, -5])
gap_stat <- clusGap(iris.scaled, FUN = hcut, K.max = 10, B = 50)
# Plot gap statistic
fviz_gap_stat(gap_stat)
But in the link hcut is not clearly defined. How can I specify single linkage hierarchical clustering to the clusGap()
function?
Do we have an equivalent of clusGap()
in python?
Thanks
The hcut()
function is part of the factorextra
package used in the link you posted:
hcut package:factoextra R Documentation
Computes Hierarchical Clustering and Cut the Tree
Description:
Computes hierarchical clustering (hclust, agnes, diana) and cut the tree into k clusters. It also accepts correlation based distance measure methods such as "pearson", "spearman" and "kendall".
R also has a built-in function, hclust()
, which can be used to perform hierarchical clustering. By default, however, it does not perform single-linkage clustering, so you can't simply replace hcut
with hclust
.
If you look at the help for clusGap()
, however, you will see that you can provide a custom clustering function to be applied:
FUNcluster: a ‘function’ which accepts as first argument a (data) matrix like ‘x’, second argument, say k, k >= 2, the number of clusters desired, and returns a ‘list’ with a component named (or shortened to) ‘cluster’ which is a vector of length ‘n = nrow(x)’ of integers in ‘1:k’ determining the clustering or grouping of the ‘n’ observations.
The hclust()
function is able to perform single-linkage hierarchical clustering, so you can do:
cluster_fun <- function(x, k) list(cluster=cutree(hclust(dist(x), method="single"), k=k))
gap_stat <- clusGap(iris.scaled, FUN=cluster_fun, K.max=10, B=50)