Search code examples
rbranchcluster-analysispruning

Counting the number of specific elements in a pruned dendrogram leaf


I am doing a cluster analysis and I want to count the number of occurences of a certain variable in a leaf of a pruned tree. Below is a simplified example where the pruned tree has only three branches. I now want to know the number of As and Bs in the three differnt branches/leafs. How can I get those?

rm(list=ls(all=TRUE))
mylabels        <- matrix(nrow=1, ncol = 20)
mylabels[1,1:10]  <- ("A")
mylabels[1,11:20] <- ("B")
myclusterdata   <- matrix(rexp(100, rate=.1), ncol=100, nrow=20)

rownames(myclusterdata)<-mylabels
hc <- hclust(dist(myclusterdata), "ave")
memb <- cutree(hc, k = 3)
cent <- NULL
for(k in 1:3){
  cent <- rbind(cent, colMeans(myclusterdata[memb == k, , drop = FALSE]))
}

hc1 <- hclust(dist(cent)^2, method = "cen", members = table(memb))
# whole tree
plot(as.dendrogram(hc),horiz=T)
# pruned tree (only 3 branches) 
plot(as.dendrogram(hc1),horiz=T)

Solution

  • OK I figured it out. The elements of the leaf are in memb... So rearranging them and combining it provides the results. Below is the code for the example

    rm(list=ls(all=TRUE))
    mylabels        <- matrix(nrow=1, ncol = 20)
    mylabels[1,1:10]  <- ("A")
    mylabels[1,11:20] <- ("B")
    myclusterdata   <- matrix(rexp(100, rate=.1), ncol=100, nrow=20)
    
    rownames(myclusterdata)<-mylabels
    hc <- hclust(dist(myclusterdata), "ave")
    memb <- cutree(hc, k = 3)
    
    cent <- NULL
    for(k in 1:3){
      cent <- rbind(cent, colMeans(myclusterdata[memb == k, , drop = FALSE]))
    }
    
    hc1 <- hclust(dist(cent)^2, method = "cen", members = table(memb))
    # whole tree
    plot(as.dendrogram(hc),horiz=T)
    # pruned tree (only 3 branches) 
    plot(as.dendrogram(hc1),horiz=T)
    
    # identify the percentages of A and B
    var_of_interest <- levels(as.factor(names(memb)))
    leaf_number <- levels(as.factor(memb))
    
    counter <- matrix(nrow=length(leaf_number), ncol = length(var_of_interest))
    for (i in seq(1:length(leaf_number))) {
       for (j in seq(1:length(var_of_interest))) {
          counter[i,j] <- length(memb[names(memb)==var_of_interest[j] & memb==leaf_number[i]])   
       }
    }
    counter[,2]/(counter[,1]+counter[,2])