Search code examples
rsilhouette

Silhouette plot from dendrogram in R


I'm using hierarchical clustering for my data. And I'd like to plot the Silhouette score across different number of cluster. I have searched a lot posts, and I found that I need to use pam for the clustering in order to plot the Silhouette score. I'm wodering if there is a way to plot based on hierarchical clustering result?

Here is a sample data:

structure(list(safe = c(1.5, 1.5, 2, 1, 1, 1, 2, 1.5, 1, 1, 1.5, 
1, 2, 1, 3.5, 1, 1, 1.5, 1.5, 1, 1, 1, 1, 2, 1.5, 1, 1.5, 1, 
1, 1.5, 2.5, 1, 2, 1, 1.5, 1, 1.5, 1, 1, 1, 2.5, 2, 1, 1, 1, 
2, 1, 1.5, 1.5, 1.5, 1.5, 3, 3, 1, 1.5, 2, 1, 1.5, 1, 1.5, 1, 
1.5, 1, 1, 1, 1.5, 1, 1, 1, 1.5, 1.5, 1.5, 1, 2, 1, 1, 1, 1, 
1, 1, 1.5, 1, 1, 2, 1, 1, 1, 1.5, 1, 1.5, 1.5, 1, 1, 1, 1, 1.5, 
1, 2, 1.5, 1, 1), nhood.soccapi = c(1.8, 1.6, 2.8, 2.2, 2, 3.6, 
3.2, 1.8, 1.6, 1.8, 1.4, 3.8, 1.6, 1, 2, 2.2, 1, 1, 2, 1, 1, 
2.2, 1, 1.4, 1.8, 2, 2.4, 1.8, 1, 2, 2.2, 1.6, 2.2, 1.2, 2, 2, 
1, 2, 2, 2, 2, 1.8, 1.4, 1.6, 1, 2.2, 1.4, 1.6, 2.4, 1.2, 1.4, 
2.4, 2, 2.4, 2, 1.8, 1.6, 1, 1.2, 1.8, 2, 3.4, 2, 2, 1, 2, 1, 
1.4, 2, 2.8, 2, 1, 2, 1.2, 1.4, 2, 1.6, 1.4, 2.2, 2.4, 1.8, 1.4, 
1.4, 2, 2.2, 1, 2, 2.2, 1, 1.8, 2.2, 1.6, 1.2, 1, 2, 1, 1.2, 
2.6, 1.8, 1.8, 1.8), DISORDER = structure(c(1L, 1L, 1L, 1L, 1L, 
1L, NA, 1L, 1L, 1L, 1L, 1L, NA, 1L, NA, NA, 2L, 1L, 2L, 1L, 1L, 
1L, 1L, 2L, 1L, 1L, 1L, NA, NA, 1L, 1L, 1L, 2L, 1L, 1L, NA, 2L, 
NA, 1L, 1L, 1L, NA, 1L, 1L, 1L, 1L, NA, 2L, 1L, 1L, 2L, 1L, NA, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, NA, 1L, 1L, 1L, 
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, 1L, 1L, 1L, NA, NA, 1L, 
1L, 2L, 2L, 2L, 1L, 1L, NA, 1L, 1L, 1L, 1L, 1L, NA, 2L, NA, 2L
), .Label = c("0", "1"), class = "factor"), Scho_socialdep1000_3 = structure(c(3L, 
3L, 3L, 3L, 3L, 2L, NA, 3L, 2L, 1L, 2L, 1L, NA, 2L, NA, NA, 1L, 
2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 3L, 2L, NA, NA, 1L, 2L, 2L, 2L, 
1L, 1L, 2L, 3L, NA, 1L, 1L, 2L, NA, 2L, 2L, 1L, 2L, NA, 1L, 3L, 
2L, 1L, 1L, NA, 3L, 2L, 3L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 
2L, 1L, 1L, 2L, 3L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, NA, 1L, 1L, 
1L, NA, NA, 1L, 3L, 3L, 2L, 3L, 2L, 1L, 2L, 1L, 1L, 3L, 2L, 1L, 
NA, 2L, NA, 3L), .Label = c("0", "1", "2"), class = "factor")), row.names = 100:200, class = "data.frame")

And my cluster code:

dist <- daisy(dt, metric = "gower")
cls.env <- hclust(dist, method = "average")

Solution

  • hclust returns just a dendrogram representing clusters inside clusters. The silhouette score is defined on one specific clustering and not on all possible clusterings. This is why it works with Partitioning Arround Medoids (PAM) out of the box. In hierarchical clustering, however, one needs to decide first which clustering to chose by cutting dendrogram tree.

    This is how to plot the silhouettes for a hierarchical clustering using k=5 clusters:

    library(tidyverse)
    library(cluster)
    
    data <- structure(list(safe = c(
      1.5, 1.5, 2, 1, 1, 1, 2, 1.5, 1, 1, 1.5,
      1, 2, 1, 3.5, 1, 1, 1.5, 1.5, 1, 1, 1, 1, 2, 1.5, 1, 1.5, 1,
      1, 1.5, 2.5, 1, 2, 1, 1.5, 1, 1.5, 1, 1, 1, 2.5, 2, 1, 1, 1,
      2, 1, 1.5, 1.5, 1.5, 1.5, 3, 3, 1, 1.5, 2, 1, 1.5, 1, 1.5, 1,
      1.5, 1, 1, 1, 1.5, 1, 1, 1, 1.5, 1.5, 1.5, 1, 2, 1, 1, 1, 1,
      1, 1, 1.5, 1, 1, 2, 1, 1, 1, 1.5, 1, 1.5, 1.5, 1, 1, 1, 1, 1.5,
      1, 2, 1.5, 1, 1
    ), nhood.soccapi = c(
      1.8, 1.6, 2.8, 2.2, 2, 3.6,
      3.2, 1.8, 1.6, 1.8, 1.4, 3.8, 1.6, 1, 2, 2.2, 1, 1, 2, 1, 1,
      2.2, 1, 1.4, 1.8, 2, 2.4, 1.8, 1, 2, 2.2, 1.6, 2.2, 1.2, 2, 2,
      1, 2, 2, 2, 2, 1.8, 1.4, 1.6, 1, 2.2, 1.4, 1.6, 2.4, 1.2, 1.4,
      2.4, 2, 2.4, 2, 1.8, 1.6, 1, 1.2, 1.8, 2, 3.4, 2, 2, 1, 2, 1,
      1.4, 2, 2.8, 2, 1, 2, 1.2, 1.4, 2, 1.6, 1.4, 2.2, 2.4, 1.8, 1.4,
      1.4, 2, 2.2, 1, 2, 2.2, 1, 1.8, 2.2, 1.6, 1.2, 1, 2, 1, 1.2,
      2.6, 1.8, 1.8, 1.8
    ), DISORDER = structure(c(
      1L, 1L, 1L, 1L, 1L,
      1L, NA, 1L, 1L, 1L, 1L, 1L, NA, 1L, NA, NA, 2L, 1L, 2L, 1L, 1L,
      1L, 1L, 2L, 1L, 1L, 1L, NA, NA, 1L, 1L, 1L, 2L, 1L, 1L, NA, 2L,
      NA, 1L, 1L, 1L, NA, 1L, 1L, 1L, 1L, NA, 2L, 1L, 1L, 2L, 1L, NA,
      1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, NA, 1L, 1L, 1L,
      2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, 1L, 1L, 1L, NA, NA, 1L,
      1L, 2L, 2L, 2L, 1L, 1L, NA, 1L, 1L, 1L, 1L, 1L, NA, 2L, NA, 2L
    ), .Label = c("0", "1"), class = "factor"), Scho_socialdep1000_3 = structure(c(
      3L,
      3L, 3L, 3L, 3L, 2L, NA, 3L, 2L, 1L, 2L, 1L, NA, 2L, NA, NA, 1L,
      2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 3L, 2L, NA, NA, 1L, 2L, 2L, 2L,
      1L, 1L, 2L, 3L, NA, 1L, 1L, 2L, NA, 2L, 2L, 1L, 2L, NA, 1L, 3L,
      2L, 1L, 1L, NA, 3L, 2L, 3L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L,
      2L, 1L, 1L, 2L, 3L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, NA, 1L, 1L,
      1L, NA, NA, 1L, 3L, 3L, 2L, 3L, 2L, 1L, 2L, 1L, 1L, 3L, 2L, 1L,
      NA, 2L, NA, 3L
    ), .Label = c("0", "1", "2"), class = "factor")), row.names = 100:200, class = "data.frame")
    
    
    
    dist <- daisy(data, metric = "gover")
    #> Warning in daisy(data, metric = "gover"): with mixed variables, metric "gower"
    #> is used automatically
    dendro <-
      dist %>%
      hclust(method = "average") %>%
      # need to put the elements into mutually exclusive sets of clusters
      cutree(k = 5)
    
    silhouette(dendro, dist) %>%
      plot()
    

    Please note that the choice of k is arbitrary. You will get different results selecting another value.

    Created on 2022-03-15 by the reprex package (v2.0.0)