Search code examples
rggplot2tidyverseboxplotsummary

ggplot2 - categorical boxplot with complex summary annotation


I'm trying to plot a categorical boxplot in R using tidyverse & ggplot2, with x = categorical column, y = continuous column, and additional information at the top of each value of x.

In each value for x, there are a number of outliers that are below a certain fixed threshold value V. The required calculation I have for this is something like:

df %>%
  select(x, y) %>%
  group_by(x) %>% summarise(n_per_grp = n()) %>% ungroup() %>%
  mutate(below_thresh = if_else(y < V, 1, 0)) %>%
  filter(below_thresh == 1) %>%
  group_by(x, n_per_grp) %>%
  summarise(n_below_thresh = n()) %>% ungroup() %>%
  mutate(perc_below_thresh = round(n_below_thresh/n_per_grp*100, 3)) %>%
  mutate(final_lbl = paste0(perc_below_thresh, "% (", n_below_thresh, "/", n_per_grp, ")")

I'm aware of stat_summary(fun.data = myfunc) to plot metrics like mean(), median(), length(), etc. However, I cannot figure out how to use geom_boxplot() & stat_summary() to annotate each boxplot with final_lbl. I'm not sure whether stat_summary() is even the right thing to use.

Any help in understading this would be greatly appreciated!


Solution

  • Figured out the answer to my own question, based on this post. I was questioning whether I really needed stat_summary() for what I was trying to do, and had the idea to "calculate" my per-group annotation label outside the main plot code.

    Here's a reproducible example:

    n_categ = 10
    n_ppl = 1000
    
    df = tibble(col_x = rep(LETTERS[1:n_categ], n_ppl)) %>% 
      arrange(col_x) %>% 
      mutate(col_ppl = rep(ids::uuid(n = n_ppl), n_categ)) %>% 
      arrange(col_x, col_ppl) %>% 
      mutate(
        col_y_1 = rbeta(n = n_categ*n_ppl, shape1 = 60, shape2 = 120),
        col_y_2 = rnorm(n = n_categ*n_ppl, mean = 100, sd = 25),
        col_y_3 = runif(n = n_categ*n_ppl, min = -100, max = 100),
        col_y_4 = rbinom(n = n_categ*n_ppl, size = 1, prob = 0.5),
        ) %>% 
      mutate(col_y = (col_y_1 * col_y_2) + (col_y_3 * col_y_4)) %>% 
      mutate(col_y = if_else(col_y < 0, 0, col_y)) %>% 
      select(-starts_with("col_y_"))
    
    df_annot = df %>% 
      mutate(ypos = round(max(col_y) * 1.05, 0)) %>% 
      group_by(col_x, ypos) %>% mutate(n_per_grp = n()) %>% ungroup() %>% 
      mutate(below_15 = if_else(col_y < 15, 1, 0)) %>% 
      filter(below_15 == 1) %>% 
      group_by(col_x, ypos) %>% mutate(n_below_15 = n()) %>% ungroup() %>% 
      mutate(perc_below_15 = n_below_15/n_per_grp*100) %>% 
      group_by(col_x, ypos) %>% 
      summarise(final_lbl = max(paste0(perc_below_15, "% (", n_below_15, "/", n_per_grp, ")")))
    
    df %>% 
      ggplot(aes(x = col_x, y = col_y)) +
        geom_boxplot(outlier.size = 0.5) +
        geom_hline(yintercept = 15, linetype = "dashed") +
        geom_text(aes(y = ypos, label = final_lbl), data = df_annot, hjust = 0) +
        scale_y_continuous(breaks = c(0, 15, 50, 100, 150, 200), minor_breaks = FALSE) +
        xlab("Category") + ylab("Measure") +
        expand_limits(y = c(0, 200)) + coord_flip()
    

    This gives me the plot I was looking for:

    Resulting boxplot with categorical labels

    Thanks for all your help though!