Search code examples
rclassificationstatistics-bootstrapstability

How often a pair appear together over a large number of cluster solutions


In order to evaluate the stability of a classification/clustering solution I am running 1,000 bootstraps of the algorithm on my data. Over these classification outcomes I would like to count how often each pair occurs in the SAME cluster. I have about 250 observations that I am clustering, making about 31k such pairs.

This is pseudo code to generate a synthetic data set:

set.seed(1)
ID <- paste ("ID",seq(1:250),sep="")
cluster1 <- sample(1:5, 250, replace=TRUE)
cluster2 <- sample(1:5, 250, replace=TRUE)
cluster3 <- sample(1:5, 250, replace=TRUE)


df <- data.frame(ID, cluster1, cluster2, cluster3)

You will see that ID3 and ID4 appear in the same cluster twice.

As with all classifications the integer used to denote the cluster membership is arbitrary.


Solution

  • Since my problem isn't too large, I used code that I would easily write in C.

    set.seed(1)
    
    pairs.matrix <- matrix(0, 250, 250)
    for (s in 1:1000){
      cluster=sample(1:5, 250, replace=TRUE)
      for (i in 1:(length(cluster)-1))
        for (j in (i+1):length(cluster))
          if (cluster[i] == cluster[j]) pairs.matrix[i,j] <- pairs.matrix[i,j] + 1
    }