I'm trying to plot a categorical boxplot in R using tidyverse
& ggplot2
, with x
= categorical column, y
= continuous column, and additional information at the top of each value of x
.
In each value for x
, there are a number of outliers that are below a certain fixed threshold value V
. The required calculation I have for this is something like:
df %>%
select(x, y) %>%
group_by(x) %>% summarise(n_per_grp = n()) %>% ungroup() %>%
mutate(below_thresh = if_else(y < V, 1, 0)) %>%
filter(below_thresh == 1) %>%
group_by(x, n_per_grp) %>%
summarise(n_below_thresh = n()) %>% ungroup() %>%
mutate(perc_below_thresh = round(n_below_thresh/n_per_grp*100, 3)) %>%
mutate(final_lbl = paste0(perc_below_thresh, "% (", n_below_thresh, "/", n_per_grp, ")")
I'm aware of stat_summary(fun.data = myfunc)
to plot metrics like mean()
, median()
, length()
, etc. However, I cannot figure out how to use geom_boxplot()
& stat_summary()
to annotate each boxplot with final_lbl
. I'm not sure whether stat_summary()
is even the right thing to use.
Any help in understading this would be greatly appreciated!
Figured out the answer to my own question, based on this post. I was questioning whether I really needed stat_summary()
for what I was trying to do, and had the idea to "calculate" my per-group annotation label outside the main plot code.
Here's a reproducible example:
n_categ = 10
n_ppl = 1000
df = tibble(col_x = rep(LETTERS[1:n_categ], n_ppl)) %>%
arrange(col_x) %>%
mutate(col_ppl = rep(ids::uuid(n = n_ppl), n_categ)) %>%
arrange(col_x, col_ppl) %>%
mutate(
col_y_1 = rbeta(n = n_categ*n_ppl, shape1 = 60, shape2 = 120),
col_y_2 = rnorm(n = n_categ*n_ppl, mean = 100, sd = 25),
col_y_3 = runif(n = n_categ*n_ppl, min = -100, max = 100),
col_y_4 = rbinom(n = n_categ*n_ppl, size = 1, prob = 0.5),
) %>%
mutate(col_y = (col_y_1 * col_y_2) + (col_y_3 * col_y_4)) %>%
mutate(col_y = if_else(col_y < 0, 0, col_y)) %>%
select(-starts_with("col_y_"))
df_annot = df %>%
mutate(ypos = round(max(col_y) * 1.05, 0)) %>%
group_by(col_x, ypos) %>% mutate(n_per_grp = n()) %>% ungroup() %>%
mutate(below_15 = if_else(col_y < 15, 1, 0)) %>%
filter(below_15 == 1) %>%
group_by(col_x, ypos) %>% mutate(n_below_15 = n()) %>% ungroup() %>%
mutate(perc_below_15 = n_below_15/n_per_grp*100) %>%
group_by(col_x, ypos) %>%
summarise(final_lbl = max(paste0(perc_below_15, "% (", n_below_15, "/", n_per_grp, ")")))
df %>%
ggplot(aes(x = col_x, y = col_y)) +
geom_boxplot(outlier.size = 0.5) +
geom_hline(yintercept = 15, linetype = "dashed") +
geom_text(aes(y = ypos, label = final_lbl), data = df_annot, hjust = 0) +
scale_y_continuous(breaks = c(0, 15, 50, 100, 150, 200), minor_breaks = FALSE) +
xlab("Category") + ylab("Measure") +
expand_limits(y = c(0, 200)) + coord_flip()
This gives me the plot I was looking for:
Resulting boxplot with categorical labels
Thanks for all your help though!