Search code examples
rggplot2dplyrboxplotmedian

ggplot2 Boxplot displays different median than calculated


I'm plotting a simple boxplot of two group's weight by year based on big data (2150000 cases). All groups have the same median except for the last group in the last year but on the boxplot, it is drawn like it is the same as every other.

 #boxplot
ggplot(dataset, aes(x=Year, y=SUM_MME_mg, fill=GenderPerson)) + 
  geom_boxplot(outlier.shape = NA)+
  ylim(0,850)


#median by group
pivot <- dataset %>%
  select(SUM_MME_mg,GenderPerson,Year )%>%
  group_by(Year, GenderPerson) %>%
  summarise(MedianValues = median(SUM_MME_mg,na.rm=TRUE))

I can't figure out what I am doing wrong or which data is more accurate in boxplot calculations or median function. R returns no error or warning.

 #my data:
> dput(head(dataset[,c(1,7,10)]))
structure(list(GenderPerson = c(2L, 1L, 2L, 2L, 2L, 2L), Year = c("2015", 
"2014", "2013", "2012", "2011", "2015"), SUM_MME_mg = c(416.16, 
131.76, 790.56, 878.4, 878.4, 878.4)), row.names = c(NA, 6L), class = "data.frame")

Solution

  • The reason for this behavior is linked to how ylim() operates. ylim() is a convenience function/wrapper for scale_y_continuous(limits=.... If you look into the documentation for the scale_continuous functions, you'll see that setting limits does not just zoom in on an area, but in fact removes all datapoints outside that area as well. This happens before the computation/stat functions, so this is why the median is different when you use ylim(). Your calculation "outside" ggplot() is taking the entire dataset, whereas the use of ylim() means that datapoints are removed before the calculation is made.

    Luckily, there's a simple fix for that, which is to use coord_cartesian(ylim=...) in place of ylim(), since coord_cartesian() will simply zoom in on the data without removing datapoints. See the difference here:

    ggplot(dataset, aes(x=Year, y=SUM_MME_mg, fill=GenderPerson)) + 
      geom_boxplot(outlier.shape = NA) + ylim(0,850)
    

    enter image description here

    ggplot(dataset, aes(x=Year, y=SUM_MME_mg, fill=GenderPerson)) + 
      geom_boxplot(outlier.shape = NA) + coord_cartesian(ylim=c(0,850))
    

    enter image description here

    The hint for this behavior should also be evident in that the first code chunk using ylim() should also give you a warning message:

    Warning message:
    Removed 3 rows containing non-finite values (stat_boxplot). 
    

    Whereas the second using coord_cartesian(ylim= does not.