Search code examples
rggplot2boxplot

Reorder a grouped boxplot by median of one group


I have a data frame consisting of 3 columns: Site, Program, Result. Here a minimal repro dataset:

> TP <- data.frame(Site = as.factor(c("Coal", "Coal", "Coal", "Coal", "STP", "STP", "STP", "STP")), 
                 Program = as.factor(c("D", "D", "H", "H", "D", "D", "H", "H")),
                 Result = c(0.65, 0.58, 0.15, 0.10, 0.55, 0.53, 0.48, 0.49))
> TP
   Site      Program                Result
   <fct>     <chr>                  <dbl>

 1 Coal      D                      0.65
 2 Coal      D                      0.58
 3 Coal      H                      0.15
 4 Coal      H                      0.10
 5 STP       D                      0.55 
 6 STP       D                      0.53
 7 STP       H                      0.48 
 8 STP       H                      0.49

In reality there are 70000 rows, made up of 50 sites and two programs.
I have created a geom_boxplot where the x variable is 'Result' and the y variable is 'Site'. For each site, I have two boxplots that contain data from the two different programs (D and H). The Y-axis is currently sorted by the overall median of a particular site, regardless of the program.

> TP$Site <- reorder(TP$Site, TP$Result, FUN = median)
> ggplot(TP, aes(x = Result, y = Site)) + geom_boxplot(aes(fill = as.factor(Program)), outliers = FALSE)

I am trying to alter the graph so that the Y-axis is in descending order of sites that had the highest median for Program D. I would still like the corresponding boxplot for each site in Program H to be immediately below the boxplot for Program D, I just want the sites ordered by Program D. Some sites only have data from Program D, and I would ideally like them ordered appropriately on the y-axis too, even though they do not have data for Program H.

I have seen many solutions on Stack Overflow using order, reorder or arrange(dplyr). I have tried several of these suggestions with no luck.

I have successfully used 'reorder' (stats) in my existing code to order the data frame by the median of the results, but I cannot replicate that result for multiple inputs and orders. I then attempted to use 'order' (base) to overcome this, but I cannot come up with a solution for multiple orders. I then also attempted to use dplyr solutions, using a combination of group_by, mutate and arrange. I cannot get this to work either.

In the supplied repro dataset, 'STP' has a higher overall median than 'Coal'. But 'Coal' has a higher median for Program D, so I would want 'Coal' to be at the top of the boxplot.

Any help is much appreciated. Let me know if I can provide more info.


Solution

  • One option would be to use an ifelse to reorder the Site using only values for D:

    library(ggplot2)
    
    ggplot(
      TP,
      aes(
        x = Result,
        y = reorder(
          Site,
          ifelse(Program == "D", Result, NA),
          FUN = median,
          na.rm = TRUE
        )
      )
    ) +
      geom_boxplot(
        aes(fill = Program),
        outliers = FALSE
      )