Search code examples
rplotdendrogramdendextend

Make dendextend assign colors to branches where I preset colors for leaves


I want to set the color of the branches of my dendrogram, given manually-assigned groups of my leaves. So I know in advance I want to color e.g. leaves A-C in red and all branches which only lead to red leaves shall be colored red as well.

I can color branches of my dendrogram using the "dendextend" package. However, I have no control about which color gets assigned to which cluster ID. dendrextend assigns the first color to the first cluster ID it finds, regardless of whether that's ID 1. However, I need ID 1 colored in color 1, etc., as I need a legend.

See this example. I want a dendrogram which colors the labels and branches A-C in red, D-F in blue and G-I in green.

suppressPackageStartupMessages(library(dendextend))
library(dplyr)

set.seed(12346)
# Sample data: 
# ------------
# l = Leaf labels | g = assigned color of leaf | x = value for clustering
dat <- tibble(l = LETTERS[1:9],
              g = factor(rep(letters[1:3], each = 3)),
              x = round(runif(9,0,10)))

# color_branches() need integer cluster IDs
dat$gi <- dat$g %>% as.integer()

# Color IDs of each group
dat %>% distinct(g, gi)
## # A tibble: 3 x 2
##   g        gi
##   <fct> <int>
## 1 a         1
## 2 b         2
## 3 c         3
# ID 1 = red, ID 2 = blue, ID 3 = green
clucols <- c("red", "blue", "green")

# Clustering & Dendrogram
# -----------------------
dst <- dist(setNames(dat$x, dat$l))
den <- as.dendrogram(hclust(dst))
o <- order.dendrogram(den)

den <- den %>%
  color_branches(col = clucols, clusters = dat$gi[o]) 
# Transfer branch colors to labels
labels_colors(den) <- get_leaves_branches_col(den)

plot(den)

# Legend
dat %>% distinct(g, gi) %>%
{legend("topright", legend = .$g, col = clucols[.$gi], lty = 1)}

Result:

The leaves are not colored in my wanted order, but by cluster position on the plot from left to right

Dendrogram with wrong coloring

If you change the set.seed(...) line to set.seed(12345), you see that the coloring seems correct. But this is because the clusters appear in correct order by chance, if seen from left to right:

Dendrogram with correct coloring

How do I make color_branches() assign colors by cluster ID, not by which cluster comes first?

Other SO questions I tried


Solution

  • A workaround is to use the function branches_attr_by_labels to assign the color to branches for each group separately.

    Replace this code in the question:

    den <- den %>%
      color_branches(col = clucols, clusters = dat$gi[o]) 
    

    with the code below.

    You need to get a list which has each element for each group. Each element in turn contains the labels you want to color and the color itself. You get it for example like this:

    library(purrr)
    colmap <- dat %>% group_by(g) %>% summarise(l = list(l)) %>% transpose()
    colmap
    
    ## [[1]]
    ## [[1]]$g
    ## [1] 1
    ## 
    ## [[1]]$l
    ## [1] "A" "B" "C"
    ## 
    ## 
    ## [[2]]
    ## [[2]]$g
    ## [1] 2
    ## 
    ## [[2]]$l
    ## [1] "D" "E" "F"
    ## 
    ## 
    ## [[3]]
    ## [[3]]$g
    ## [1] 3
    ## 
    ## [[3]]$l
    ## [1] "G" "H" "I"
    

    Then, for each element, apply branches_attr_by_labels. As it takes a dendrogram and some changing parameters and also returns a dendrogram, you can use purrr::reduce or base::Reduce:

    den <- reduce(.x = colmap, .init = den, .f = function(d, m) 
      branches_attr_by_labels(d, m$l, clucols[m$g] ))
    

    Alternatively, slightly longer:

    for(e in colmap){
      den <- branches_attr_by_labels(den, e$l, clucols[e$g])
    }
    

    Result for set.seed(123456). Compare to above picture:

    dendrogram with correct coloring