I've got a boxplot like:
Which is really hard to tell which group tends to have higher values.
How do I add lines in the chart that connect Q1, Q2, Q3 for each group respectively?
You can use stat_summary
to apply an aggregation function to each group of y values and pass the result to a geom. There's a caveat: boxplots are already grouped (each box is a group), so to get a geom to cross boxplots, you'll have to set the grouping differently between geom_boxplot
and stat_summary
. Think of the group aesthetic as defining what should be connected or separate.
See ?stat_summary
for the full details of how you can specify the aggregation function, but the tl;dr is that it can either take independent functions for y, ymin, and ymax, or one called fun.data
that returns a data frame with columns named y
, ymin
, and ymax
. Args can also be passed through fun.args
if you don't want to write anonymous functions. If the geom doesn't need all three aesthetics, you can just specify one (e.g. lines just need y
).
So to plot lines at the first, second, and third quartiles, adding some dodging so everything lines up,
library(ggplot2)
data('mpg', package = 'ggplot2')
mpg$cyl <- factor(mpg$cyl)
# note grouping here is set to what stat_summary needs (so we don't have to override so many times)
ggplot(mpg, aes(class, hwy, color = cyl, group = cyl)) +
geom_boxplot(aes(group = NULL)) + # override grouping here back to the default
stat_summary(geom = 'line', fun.y = median, position = position_dodge(0.75)) +
stat_summary(geom = 'line', fun.y = quantile, fun.args = list(probs = 0.25), position = position_dodge(0.75)) +
stat_summary(geom = 'line', fun.y = quantile, fun.args = list(probs = 0.75), position = position_dodge(0.75))
This plot is pretty busy, though. Another parallel coordinates/boxplot cross might be to use geom_ribbon
(which takes ymin and ymax) for the first and third quantiles:
ggplot(mpg, aes(class, hwy, color = cyl, fill = cyl, group = cyl)) +
geom_boxplot(aes(group = NULL, fill = NULL)) +
stat_summary(geom = 'line', fun.y = median, position = position_dodge(0.75)) +
stat_summary(geom = 'ribbon', alpha = 0.3, position = position_dodge(0.75),
fun.data = function(x) {
data.frame(ymin = quantile(x, 0.25),
ymax = quantile(x, 0.75))
})