Search code examples
rggplot2bar-chartmeanlogarithm

The means from stats_summary on my bar plot with a y-axis in log scale are incorrect


I am wanting to make a bar plot, with error bars representing standard deviations and an overlay of a scatter plot to show individual data points. The data is microbial count data and there are 2 conditions, for which there are three time points. For 5 out of the 6 conditions, some of the individual data points are zero, which I inputted as 1 in order to have a data point showing (in the grand scheme of things, the 0 to 1 transformation does not change much given how large the other numbers are). However, when plotted, the means for most of the bars (some "Steel" and all the "Copper" bars) are not correct. I assume it's due to the transformation of the y-axis in log10 scale but I can't figure out how to fix it! I am quite new to R and even more to ggplot2 so I'm at a loss here. Any help would be greatly appreciated.

Below is the code I have tried, with a subset of the data.

Data:

Material<-c("Copper T0", "Copper T0","Copper T0","Copper T0","Copper T0","Copper T0","Copper T0","Copper T0","Copper T0",
             "Copper T+1h","Copper T+1h","Copper T+1h","Copper T+1h","Copper T+1h","Copper T+1h","Copper T+1h","Copper T+1h","Copper T+1h",
             "Copper T+2h","Copper T+2h","Copper T+2h","Copper T+2h","Copper T+2h","Copper T+2h","Copper T+2h","Copper T+2h","Copper T+2h",
             "Steel T0", "Steel T0","Steel T0","Steel T0","Steel T0","Steel T0","Steel T0","Steel T0","Steel T0",
             "Steel T+1h", "Steel T+1h","Steel T+1h","Steel T+1h","Steel T+1h","Steel T+1h","Steel T+1h","Steel T+1h","Steel T+1h",
             "Steel T+2h","Steel T+2h","Steel T+2h","Steel T+2h","Steel T+2h","Steel T+2h","Steel T+2h","Steel T+2h","Steel T+2h")
  CFU<-c(1,1,1,1,1,1,1,1,12500,
             1,1,1,1,1,1,1,1,12500,
             1,1,1,1,1,1,1,1,12500,
             262500,137500,437500,112500,37500,62500,250000,225000,50000,
             50000,112500,37500,62500,75000,1,12500,87500,1,
             1,1,25000,12500,1,1,137500,150000,112500)
data3<-data.frame(Material,CFU)
print(data3)

Code:

library(readxl)
library(ggplot2)
library(Hmisc)
library(scales)
order<-c("Steel T0", "Steel T+1h", "Steel T+2h", "Copper T0", "Copper T+1h", "Copper T+2h")
truncate_error_bars <- function(x) {
  data <- mean_sdl(x, mult = 1)
  data$ymin <- pmax(data$ymin, 0)
  return(data)
}
ggplot(data3, aes(x=factor(Material, order), y=`CFU`, fill=factor(Material))) +
  stat_summary(fun=mean, geom="bar", show.legend=FALSE, color="black") +
  stat_summary(fun.data=truncate_error_bars,
               geom="errorbar", width=.2) +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        panel.background = element_blank(),axis.line=element_line(colour = "darkgrey"))+
  scale_y_continuous(trans= 'log10', expand = c(0, 0.0001),
                     breaks = trans_breaks("log10",n=8, function(x) 10^x))+
  geom_point(position = position_jitter(width = 0.2), color = "black", show.legend=FALSE)+
  expand_limits(x = 0, y = c(0,1e+07))+
  theme(axis.title.x=element_blank(), axis.ticks.x=element_blank())+
  scale_fill_manual(values=c("sienna3", 
                             "sienna3", 
                             "sienna3", 
                             "lightslategrey",
                             "lightslategrey",
                             "lightslategrey")) +
  ylab(bquote(PFU.cm^-2))

The graph I got from this is the following:

Bar chart

The means that are showing on the graph are not the correct ones for some of the "Steel" data and for the "Copper" data. I would like the bars to reach the correct means if possible.

Thank you so much for you help in advance!


Solution

  • One way to manage this is to compute the summary statistics prior to creating the plot.

    library(ggplot2)
    library(scales)
    library(Hmisc)
    library(dplyr)
    
    data_stats <- data3 |>
      reframe(mean = mean(CFU),
              upper = smean.sdl(CFU, mult = 1)[[3]],
              lower = pmax(smean.sdl(CFU, mult = 1)[[2]], 1),
                .by = Material)
    

    Now when you apply a transformation on the y-axis, the mean isn't unintentionally calculated on transformed values.

    data3 |>
      ggplot(aes(x = factor(Material, order),
                 fill = Material)) +
      geom_col(data = data_stats, aes(y = mean)) +
      geom_errorbar(data = data_stats, aes(ymin = lower, ymax = upper),
                    width = 0.2) +
      geom_point(aes(y = CFU),
                 position = position_jitter(width = 0.2),
                 color = "black") +
      scale_y_continuous(trans = "log10", breaks = breaks_log(n = 8, base = 10)) +
      expand_limits(x = 0, y = c(1,1e+07)) +
      scale_fill_manual(values=c("sienna3", 
                                 "sienna3", 
                                 "sienna3", 
                                 "lightslategrey",
                                 "lightslategrey",
                                 "lightslategrey")) +
      ylab(bquote(PFU.cm^-2)) +
      theme(axis.title.x=element_blank(),
            axis.ticks.x=element_blank(),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank(),
            panel.background = element_blank(),
            axis.line=element_line(colour = "darkgrey"),
            legend.position = "none")
    

    Created on 2024-06-03 with reprex v2.1.0