Search code examples
rggplot2plotwidthboxplot

How to vary boxplot width when making custom quantiles?


I wanted to set the length of the boxplot whiskers to be the median of the data +/- 1.96*standard deviation (aka the 95% distribution of the data). I did this by calculating the boxplot statistics using aggregate and setting those to be the minimum, lower quartile, median, etc. How can I set the boxplot width to vary so that it is proportional to the square root of the number of observations (like ggplot does with varwidth = TRUE)? Anything I currently try (setting weight, width) varies the width of all of the categories equally. Thank you.

rm(list = ls())
library(ggplot2)

set.seed(1)

residuals <- runif(n=1000, min=-3, max=3)
category <- c('A','A','A','B','B','C','D','E','E','F')
df1 <- data.frame(category,residuals)


boxplot_stats <- aggregate(residuals ~ category, df1, function(x) {
  median_val = median(x)
  z_score = 1.96
  min_quantile = median_val - z_score * sd(x)
  lower_quantile = quantile(x, c(0.25))
  upper_quantile = quantile(x, c(0.75))
  max_quantile = median_val + z_score * sd(x)
  n_obs_sqrt = sqrt(length(x))
  c(min_quantile, lower_quantile, median_val, upper_quantile, max_quantile, n_obs_sqrt)
})

custom_boxplot <- ggplot(boxplot_stats, aes(x=category))+
  geom_boxplot(aes(ymin = residuals[, 1], lower = residuals[, 2], middle = residuals[, 3], upper = residuals[, 4], ymax = residuals[, 5]), stat = "identity", color = "black",fill="lightblue") +
  labs(title="boxplot",x="Category",y="Residuals") + 
  theme_bw()
print(custom_boxplot)

Solution

  • Although it's not necessary (see footnote), the more idiomatic way to do this would be to use summarize instead of aggregate so that you can have distinct column names rather than matrix columns used inside ggplot:

    library(tidyverse)
    
    boxplot_stats <- df1 %>%
      group_by(category) %>%
      summarise(median_val     = median(residuals),
                min_quantile   = median_val - 1.96 * sd(residuals),
                lower_quantile = quantile(residuals, c(0.25)),
                upper_quantile = quantile(residuals, c(0.75)),
                max_quantile   = median_val + 1.96 * sd(residuals),
                n_obs_sqrt     = sqrt(n()))
    
    boxplot_stats
    #> # A tibble: 6 x 7
    #>   category median_val min_quantile lower_quantile upper_quantile max_quantile
    #>   <chr>         <dbl>        <dbl>          <dbl>          <dbl>        <dbl>
    #> 1 A            0.0247        -3.44         -1.34           1.66          3.49
    #> 2 B           -0.166         -3.43         -1.43           1.38          3.10
    #> 3 C           -0.445         -3.59         -1.67           0.905         2.70
    #> 4 D           -0.443         -3.70         -1.84           0.971         2.81
    #> 5 E           -0.0109        -3.44         -1.35           1.53          3.41
    #> 6 F            0.739         -2.82         -0.985          1.84          4.30
    #> # i 1 more variable: n_obs_sqrt <dbl>
    

    Because width is not an aesthetic mapping, you need to ensure that the actual desired widths are plotted. None of these should be larger than 1 (otherwise the boxes will overlap), so the correct expression for widths would be boxplot_stats$n_obs_sqrt/max(boxplot_stats$n_obs_sqrt). You also need to ensure you specify position = position_dodge to get them to line up correctly:

    ggplot(boxplot_stats, aes(x = category)) +
      geom_boxplot(aes(ymin   = min_quantile, lower = lower_quantile,
                       middle = median_val,   upper = upper_quantile, 
                       ymax   = max_quantile, weight = n_obs_sqrt), 
                   width = boxplot_stats$n_obs_sqrt/max(boxplot_stats$n_obs_sqrt),
                   stat = "identity",
                   color = "black", fill = "lightblue", 
                   position = position_dodge()) +
      labs(title = "boxplot", x = "Category", y = "Residuals") + 
      theme_bw()
    

    enter image description here


    Footnote

    If for some reason you can't use dplyr then the equivalent using your aggregate output would be:

    ggplot(boxplot_stats, aes(category)) +
      geom_boxplot(aes(ymin = residuals[, 1],   lower = residuals[, 2], 
                       middle = residuals[, 3], upper = residuals[, 4], 
                       ymax = residuals[, 5]), 
                   stat = "identity", color = "black", fill = "lightblue", 
                   width = boxplot_stats$residuals[,6]/
                             max(boxplot_stats$residuals[,6]),
                   position = position_dodge()) +
      labs(title = "boxplot", x = "Category", y = "Residuals") + 
      theme_bw()
    

    enter image description here

    Created on 2023-08-09 with reprex v2.0.2