Search code examples
rhierarchical-clusteringtraminer

Find number of clusters using distance matrix with hierarchical clustering


How do I determine the optimal number of clusters while using hierarchical clustering. If I am just having the distance matrix as I am measuring only pairwise distances (levenshtein distances), how do I find out the optimal number of clusters? I referred to other posts they all use k-means, hierarchical but not for string type of data as shown below. Any suggestions on how to use R to find the number of clusters?

 set.seed(1)
 rstr <- function(n,k){   # vector of n random char(k) strings
 sapply(1:n,function(i) {do.call(paste0,as.list(sample(letters,k,replace=T)))})
 }

str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
# Levenshtein Distance
 d  <- adist(str)
 rownames(d) <- str
hc <- hclust(as.dist(d))

Solution

  • Several statistics can be used.

    Look for example at the WeightedCluster package that can compute and plot a series of such statistics.

    To illustrate, you get the optimal number of groups for each available statistics as follows:

    library("WeightedCluster")
    hcRange <- as.clustrange(hc, diss=as.dist(d), ncluster=6) 
    summary(hcRange)
    ##      1. N groups   1.  stat
    ## PBC            3  0.8799136
    ## HG             3  1.0000000
    ## HGSD           3  0.9987651
    ## ASW            3  0.4136550
    ## ASWw           3  0.4722895
    ## CH             3  8.3605263
    ## R2             6  0.4734561
    ## CHsq           3 20.6538462
    ## R2sq           6  0.6735039
    ## HC             3  0.0000000
    

    You can also plot the statistics (here we show the Average silhouette width, ASWw, Huber's Gamma, HG, and the Point biserial correlation) for all the computed solutions

    plot(hcRange, stat = c("ASWw", "HG", "PBC"), lwd = 2)
    

    enter image description here

    The better solution seems to be the three groups solution.