In order to evaluate the stability of a classification/clustering solution I am running 1,000 bootstraps of the algorithm on my data. Over these classification outcomes I would like to count how often each pair occurs in the SAME cluster. I have about 250 observations that I am clustering, making about 31k such pairs.
This is pseudo code to generate a synthetic data set:
set.seed(1)
ID <- paste ("ID",seq(1:250),sep="")
cluster1 <- sample(1:5, 250, replace=TRUE)
cluster2 <- sample(1:5, 250, replace=TRUE)
cluster3 <- sample(1:5, 250, replace=TRUE)
df <- data.frame(ID, cluster1, cluster2, cluster3)
You will see that ID3 and ID4 appear in the same cluster twice.
As with all classifications the integer used to denote the cluster membership is arbitrary.
Since my problem isn't too large, I used code that I would easily write in C.
set.seed(1)
pairs.matrix <- matrix(0, 250, 250)
for (s in 1:1000){
cluster=sample(1:5, 250, replace=TRUE)
for (i in 1:(length(cluster)-1))
for (j in (i+1):length(cluster))
if (cluster[i] == cluster[j]) pairs.matrix[i,j] <- pairs.matrix[i,j] + 1
}