I wanted to set the length of the boxplot whiskers to be the median of the data +/- 1.96*standard deviation (aka the 95% distribution of the data). I did this by calculating the boxplot statistics using aggregate and setting those to be the minimum, lower quartile, median, etc. How can I set the boxplot width to vary so that it is proportional to the square root of the number of observations (like ggplot does with varwidth = TRUE)? Anything I currently try (setting weight, width) varies the width of all of the categories equally. Thank you.
rm(list = ls())
library(ggplot2)
set.seed(1)
residuals <- runif(n=1000, min=-3, max=3)
category <- c('A','A','A','B','B','C','D','E','E','F')
df1 <- data.frame(category,residuals)
boxplot_stats <- aggregate(residuals ~ category, df1, function(x) {
median_val = median(x)
z_score = 1.96
min_quantile = median_val - z_score * sd(x)
lower_quantile = quantile(x, c(0.25))
upper_quantile = quantile(x, c(0.75))
max_quantile = median_val + z_score * sd(x)
n_obs_sqrt = sqrt(length(x))
c(min_quantile, lower_quantile, median_val, upper_quantile, max_quantile, n_obs_sqrt)
})
custom_boxplot <- ggplot(boxplot_stats, aes(x=category))+
geom_boxplot(aes(ymin = residuals[, 1], lower = residuals[, 2], middle = residuals[, 3], upper = residuals[, 4], ymax = residuals[, 5]), stat = "identity", color = "black",fill="lightblue") +
labs(title="boxplot",x="Category",y="Residuals") +
theme_bw()
print(custom_boxplot)
Although it's not necessary (see footnote), the more idiomatic way to do this would be to use summarize
instead of aggregate
so that you can have distinct column names rather than matrix columns used inside ggplot:
library(tidyverse)
boxplot_stats <- df1 %>%
group_by(category) %>%
summarise(median_val = median(residuals),
min_quantile = median_val - 1.96 * sd(residuals),
lower_quantile = quantile(residuals, c(0.25)),
upper_quantile = quantile(residuals, c(0.75)),
max_quantile = median_val + 1.96 * sd(residuals),
n_obs_sqrt = sqrt(n()))
boxplot_stats
#> # A tibble: 6 x 7
#> category median_val min_quantile lower_quantile upper_quantile max_quantile
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 0.0247 -3.44 -1.34 1.66 3.49
#> 2 B -0.166 -3.43 -1.43 1.38 3.10
#> 3 C -0.445 -3.59 -1.67 0.905 2.70
#> 4 D -0.443 -3.70 -1.84 0.971 2.81
#> 5 E -0.0109 -3.44 -1.35 1.53 3.41
#> 6 F 0.739 -2.82 -0.985 1.84 4.30
#> # i 1 more variable: n_obs_sqrt <dbl>
Because width
is not an aesthetic mapping, you need to ensure that the actual desired widths are plotted. None of these should be larger than 1 (otherwise the boxes will overlap), so the correct expression for widths
would be boxplot_stats$n_obs_sqrt/max(boxplot_stats$n_obs_sqrt)
. You also need to ensure you specify position = position_dodge
to get them to line up correctly:
ggplot(boxplot_stats, aes(x = category)) +
geom_boxplot(aes(ymin = min_quantile, lower = lower_quantile,
middle = median_val, upper = upper_quantile,
ymax = max_quantile, weight = n_obs_sqrt),
width = boxplot_stats$n_obs_sqrt/max(boxplot_stats$n_obs_sqrt),
stat = "identity",
color = "black", fill = "lightblue",
position = position_dodge()) +
labs(title = "boxplot", x = "Category", y = "Residuals") +
theme_bw()
Footnote
If for some reason you can't use dplyr
then the equivalent using your aggregate
output would be:
ggplot(boxplot_stats, aes(category)) +
geom_boxplot(aes(ymin = residuals[, 1], lower = residuals[, 2],
middle = residuals[, 3], upper = residuals[, 4],
ymax = residuals[, 5]),
stat = "identity", color = "black", fill = "lightblue",
width = boxplot_stats$residuals[,6]/
max(boxplot_stats$residuals[,6]),
position = position_dodge()) +
labs(title = "boxplot", x = "Category", y = "Residuals") +
theme_bw()
Created on 2023-08-09 with reprex v2.0.2