Search code examples
rvisualizationboxplotquartile

R: plot Q1, Q2, Q3 & mean by category


I have a massive dataset and am trying to plot a sort of boxplot with the Q1, Q2, Q3 stats by category. I would like a boxplot visualization with the standard interquartile range box and thicker line outlining the median, but not the whiskers and outliers. I would also like to add the average by category to it.

Because my data is massive it would be easier to compute all of this and then plot the stats as identity. I found the code below which computes the stats to then plot them. However, it doesn't work when I delete ymin and ymax from the code. I would like a similar code that: (i) does not have the max and min, (ii) adds the average as a dot, (iii) computes and plots stats by category.

y <- rnorm(100)
df <- data.frame(
  x = 1,
  y0 = min(y),
  y25 = quantile(y, 0.25),
  y50 = median(y),
  y75 = quantile(y, 0.75),
  y100 = max(y)
)
ggplot(df, aes(x)) +
  geom_boxplot(
   aes(ymin = y0, lower = y25, middle = y50, upper = y75, ymax = y100),
   stat = "identity"
 )

Solution

  • Assuming the category is x and you calculated the statistics for each category (which i simulate in the example), you can set ymax and ymin to Q1 and Q3 to hide them:

    library(ggplot2)
    set.seed(1234)
    y1 <- rnorm(100)
    y2 <- rnorm(100)
    
    df <- data.frame(
      x = as.factor(1:2),
      y0 = c(min(y1),min(y2)),
      y25 = c(quantile(y1, 0.25),quantile(y2, 0.25)),
      y50 = c(quantile(y1, 0.5),quantile(y2, 0.5)),
      y75 = c(quantile(y1, 0.75),quantile(y2, 0.75)),
      y100 = c(max(y1),max(y2)),
      mean = c(mean(y1),mean(y2))
    )
    df$y100<-df$y75
    df$y0<-df$y25
    
    ggplot(df, aes(x)) +
      geom_boxplot(
       aes(group=x, ymin = y0, lower = y25, middle = y50, upper = y75, ymax = y100),
       stat = "identity"
     )  + geom_point(aes(group=x, y=mean))