Search code examples
rggplot2lineboxplot

add lines for upper/lower bond, median to ggplot boxplot


I've got a boxplot like:

enter image description here

Which is really hard to tell which group tends to have higher values.

How do I add lines in the chart that connect Q1, Q2, Q3 for each group respectively?


Solution

  • You can use stat_summary to apply an aggregation function to each group of y values and pass the result to a geom. There's a caveat: boxplots are already grouped (each box is a group), so to get a geom to cross boxplots, you'll have to set the grouping differently between geom_boxplot and stat_summary. Think of the group aesthetic as defining what should be connected or separate.

    See ?stat_summary for the full details of how you can specify the aggregation function, but the tl;dr is that it can either take independent functions for y, ymin, and ymax, or one called fun.data that returns a data frame with columns named y, ymin, and ymax. Args can also be passed through fun.args if you don't want to write anonymous functions. If the geom doesn't need all three aesthetics, you can just specify one (e.g. lines just need y).

    So to plot lines at the first, second, and third quartiles, adding some dodging so everything lines up,

    library(ggplot2)
    
    data('mpg', package = 'ggplot2')
    mpg$cyl <- factor(mpg$cyl)
    
    # note grouping here is set to what stat_summary needs (so we don't have to override so many times)
    ggplot(mpg, aes(class, hwy, color = cyl, group = cyl)) +
        geom_boxplot(aes(group = NULL)) +    # override grouping here back to the default
        stat_summary(geom = 'line', fun.y = median, position = position_dodge(0.75)) + 
        stat_summary(geom = 'line', fun.y = quantile, fun.args = list(probs = 0.25), position = position_dodge(0.75)) + 
        stat_summary(geom = 'line', fun.y = quantile, fun.args = list(probs = 0.75), position = position_dodge(0.75)) 
    

    boxplots with lines

    This plot is pretty busy, though. Another parallel coordinates/boxplot cross might be to use geom_ribbon (which takes ymin and ymax) for the first and third quantiles:

    ggplot(mpg, aes(class, hwy, color = cyl, fill = cyl, group = cyl)) + 
        geom_boxplot(aes(group = NULL, fill = NULL)) + 
        stat_summary(geom = 'line', fun.y = median, position = position_dodge(0.75)) + 
        stat_summary(geom = 'ribbon', alpha = 0.3, position = position_dodge(0.75),
                     fun.data = function(x) {
                         data.frame(ymin = quantile(x, 0.25), 
                                    ymax = quantile(x, 0.75))
                     })
    

    boxplots with line and ribbon