Search code examples
rhierarchical-clusteringstatistics-bootstrapkolmogorov-smirnov

Uncertainty in Clustering


I am applying hierarchical clustering to my data set which includes 30 studies. An example of my data set is:

   X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14
1  2  2  7  7  0  0  0  0  0  0  0   0   0   0   0
2  0  5  37 27 5  1  2  2  2  2  1   1   1   0   0
                      :
                      :
30 0  0  3  1  2  5  7  0  0  0  0   0   0   0   0

I used the following code to apply a bootstrap sampling version of the kolmogorov-sminrov test to calculate the distance matrix d and applied the 'complete-link' algorithm.

p <- outer(1:30, 1:30, Vectorize(function(i,j)
  {ks.boot(as.numeric(rep(seq(0,14,1),as.vector(test[i,]))),
           as.numeric(rep(seq(0,14,1),as.vector(test[j,]))),nboots=10000)
              $ks.boot.pvalue}))
d <- as.dist(as.matrix(1-p))

hc1 <- hclust(d,method = "complete")
plot(hc1)

This samples 10,000 (KS) p-values between each study. So for s1 & s2, s1 & s3 .... s1 & s30, s2 & s3 .... s 29 & s30 and stores the probabilities into a 30 x 30 matrix.

If I repeat this process by simply rerunning the code and store the p-values in another variable and plot a dendrogram then I'll obtain a slightly different dendrogram with some studies altering position. I've attached a few examples

Some of the differences are very subtle to visualise but the height changes slightly and the position of the big clusters change too. I'm interested in two types of uncertainties: uncertainties due to bootstrap sampling which is what the dendrograms try to show.

The 2nd is uncertainty due to sample size, i.e., how does sample size in a study effect the clustering orders. I want to somehow visualise this and my only guess is to remove a study and compare the new dendrogram with the original and find the differences manually which would take a lot of time.

I've also checked the pvclust package for hierarchical clustering but I don't think its applicable when I'm using KS bootstrapping.

D1

D2

D3


Solution

  • There are lots of ways you could take this analysis. You are calculating a single distance matrix based on bootstrapped data. Instead, you should be generating a bootstrapped tree with bootstrap branch support. This will give you an idea of how robust the clustering is.

    Here's an example using the Iris dataset and this R package: https://github.com/sgibb/bootstrap

    library(bootstrap)
    library(dplyr)
    
    set.seed(1)
    data(iris)
    rownames(iris) <- paste0(iris$Species, ".", 1:nrow(iris))
    iris <- iris %>% sample_n(25) %>% dplyr::select(-Species) %>% data.matrix
    
    createHclustObject <- function(x)hclust(dist(x), "ave")
    b <- bootstrap(iris, fun=createHclustObject, n=1000L)
    
    hc <- createHclustObject(iris)
    plot(hc)
    bootlabels.hclust(hc, b, col="blue")
    

    enter image description here

    See also:

    http://www.pnas.org/content/93/23/13429 (Original?) PNAS paper describing bootstrap branch support for phylogenetic trees