Search code examples
rcluster-analysishierarchical-clustering

Find the number of clusters using clusGAP function in R


Could you help me find the ideal number of clusters using the clusGap function? There is a similar example in this link: https://www.rdocumentation.org/packages/factoextra/versions/1.0.7/topics/fviz_nbclust

But I would like to do it for my case. My code is below:

library(cluster)

df <- structure(
list(Propertie = c(1,2,3,4,5,6,7,8), Latitude = c(-24.779225, -24.789635, -24.763461, -24.794394, -24.747102,-24.781307,-24.761081,-24.761084),
Longitude = c(-49.934816, -49.922324, -49.911616, -49.906262, -49.890796,-49.8875254,-49.8875254,-49.922244),
waste = c(526, 350, 526, 469, 285, 433, 456,825)),class = "data.frame", row.names = c(NA, -8L))

df<-scale(df)

hcluster = clusGap(df, FUN = hcut, K.max = 100, B = 50)
Clustering k = 1,2,..., K.max (= 100): .. Error in sil.obj[, 1:3] : incorrect number of dimensions

Solution

  • The issue here is that you have specified K.max as 100, however, you only have eight observations in your dataset. As noted in the clusGap documentation, K.max is the
    the maximum number of clusters to consider, hence, in your case, K.max cannot be greater than seven.

    It is unclear to me that clustering is appropriate on a dataset of such small size. Nevertheless, please see below a working implementation. I have modified the plot_clusgap function from the R/Bioconductor phyloseq package to visualize the results.

    library(data.table)
    library(cluster)
    library(factoextra) # for hcut function
    
    df <- data.table(properties = c(1,2,3,4,5,6,7,8),
                    latitude = c(-24.779225, -24.789635, -24.763461, -24.794394, -24.747102,-24.781307,-24.761081,-24.761084),
                    longitude = c(-49.934816, -49.922324, -49.911616, -49.906262, -49.890796,-49.8875254,-49.8875254,-49.922244),
                    waste = c(526, 350, 526, 469, 285, 433, 456,825))
    
    df <- scale(df)
    
    # perform clustering, B = 500 is recommended
    hcluster = clusGap(df, FUN = hcut, K.max = 7, B = 500)
    
    # extract results
    dat <- data.table(hcluster$Tab)
    dat[, k := .I]
    
    # visualize gap statistic
    p <- ggplot(dat, aes(k, gap)) + geom_line() + geom_point(size = 3) +
      geom_errorbar(aes(ymax = gap + SE.sim, ymin = gap - SE.sim), width = 0.25) +
      ggtitle("Clustering Results") +
      labs(x = "Number of Clusters", y = "Gap Statistic") +
      theme(plot.title = element_text(size = 16, hjust = 0.5, face = "bold"),
            axis.title = element_text(size = 12, face = "bold"))
    

    Here is the resulting figure:

    Clustering results produced by above script.

    I should note that all the gap statistic values are negative. This indicates that the optimal number of clusters is k = 1 (i.e., no clustering).