Search code examples
rggplot2meanmedian

ggplot2: mean and median in geom_violin


I was plotting a violin plot and mean and median values as follows:

test <- read.csv2("http://www.reduts.net/test.csv", sep=",")

ggplot(data = test, aes(y = var, x = as.factor(grp))) +
  geom_violin() +
  stat_summary(fun.y = mean, geom = "point", shape = 23, size = 2)+
  stat_summary(fun.y = median, geom = "point", size = 2, color = "red")+
  xlab("Group") +
  ylab("EUR") +
  scale_y_continuous(limits = c(0,1000), breaks = seq(0,1000,200))+
  ggsave("image.jpg", dpi = 300, units = 'cm', height = 10, width = 22)

library(psych)
describe(test$var)

Now, my problem is that all the group-means displayed in the image are far lower than the mean I get when using psych::describe() over all groups.

Is it possible, that the means and medians computed for each group do not include the outliers in each group (only the values within the whiskers)? And if so, how can I plot the "real" medians/means for all data points?


Solution

  • Using scale_y_continuous(limits=) filters the underlying data, so the mean/median in stat_summary are of the pre-filtered data.

    To simply zoom in without changing the underlying data, use coord_cartesian

    e.g.

    + coord_cartesian(ylim=c(0, 1000))
    

    Here is a reproducible example:

    library(ggplot2)
    p <- ggplot(iris, aes(x=Species, y=Sepal.Length)) + geom_point() +
        stat_summary(fun.y='mean', geom='point', size=2, col='red')
    p
    # mean(subset(iris, Species == 'setosa')$Sepal.Length) # 5.006
    

    Note that the average Sepal Length for setosa is about 5. now let's limit the y axis.

    p + scale_y_continuous(lim=c(5, 8), minor_breaks=seq(5, 8, by=0.1))
    Warning messages:
    1: Removed 22 rows containing non-finite values (stat_summary). 
    2: Removed 22 rows containing missing values (geom_point).
    

    Note the warning messages, and see that in the resulting plot the average Sepal Length for setosa is now a bit more than 5.2. To confirm the scale_y_continuous is indeed filtering the data before calculating the stat_summary,

    mean(subset(iris, Species == 'setosa' & Sepal.Length >= 5)$Sepal.Length)
    # 5.23
    

    whereas if I just do

    p + coord_cartesian(ylim=c(5, 8))
    

    the means are as they were on the original data. (You can still use scale_y_continuous for the breaks, just don't use the limits).