Search code examples
rdendrogramhclustdendextend

Color branches of a dendrogram based on column in dataframe


I want to color the branches of a dendrogram based on the value in a column of a dataframe used in the hclust function.

Before you mark this question as duplicate as was done in this question, which links to this question. Note that this was actually never addressed fully in the answer. It is easy to color branches based on the topology of the dendrogram, but I cannot figure out how to color branches based on a column in the dataframe that was used in the hclust function.

I've tried using the dendextend package in two very similar ways:

library(dendextend)
par(mar = c(2,1,0,8)) #make sure the whole plot is on the page
hc <- hclust(dist(mtcars)) #cluster dataframe based on distance
dend <- as.dendrogram(hc) #use dendextend to create dendrogram
dend2 <- color_branches(dend, col = mtcars$cyl) #attempt but fail at coloring branches
plot (dend2, horiz = TRUE) #plot dendrogram

and

dend3 <- assign_values_to_leaves_edgePar(dend, value = mtcars$cyl, edgePar = "col") #attempt but fail at coloring branches
plot (dend3, horiz = TRUE) #plot dendrogram

replacing mtcars$cyl with factor(mtcars$cyl doesnt solve the problem either.

Both of these solutions produce a dendrogram that is not properly colored. enter image description here It appears that it is ordering the colors from the bottom to the top of the dendrogram based on the order of the values in the cyl column, but since the branches are no longer in that order, the coloring doesn't make any sense. I would prefer not to sort the dataframe as a way around this problem.

Thanks.


Solution

  • You need to put the colors in the order of the leaves of the dendrogram. You can use labels() to extract the names used on the leaves

    dend2 <- color_branches(dend, col=mtcars[labels(dend),"cyl"])
    

    enter image description here