Search code examples
rggplot2plotmedianconfidence-interval

Calculation of confidence intervals of the median in ggplot


I have a dataset consisting of a virtual population with multiple observations per ID. This population is categorized into two groups. I would like to plot the observations vs. time for each group. I could successfully calculate the mean and the confidence interval of the mean using stat_summary. However, in a second plot, I would like to do the same for the median. stat_summary(fun.y=median) does work but I could not find a solution to calculate the proper confidence intervals for the median. Do I have to calculate them separately?

Here's the relevant part of the ggplotcode:

my.labels <- c("1250 mg/m²\nwith dose\nadjustments",
           "1250 mg/m²\nwithout dose\nadjustments")
ggplot(data, aes(x=TIME, y=P2X))+ 
      stat_summary(geom="ribbon", fun.data=mean_cl_boot, 
    fun.args=list(conf.int=0.95), 
                  aes(fill=factor(GROUP),color=NA), alpha = .55)+
      stat_summary(geom="line", fun.y=mean, aes(linetype=factor(GROUP)))+
      scale_fill_manual("dose",labels=my.labels,values=c("#999999", "lightblue"))+
      scale_linetype_manual("dose",labels=my.labels,values = c("solid","dashed"))+
      scale_color_manual("dose",values=c("black", "black"))+labs(x="Time [cycle]",y="Probability [%]")+
      scale_x_continuous(breaks = c(0,1,2,3,4,5,6))+ theme(legend.title=element_blank())+
      scale_y_continuous(labels = function(x) paste0(x*100))

Thank you in advance!


Solution

  • I use the following function to easily add non-parametric estimates and confidence intervals for the median with stat_summary:

    
    quantile_cl <- function(y, q=0.5, conf.level = 0.95, na.rm=TRUE) {
      alpha <- 1 - conf.level
      if (na.rm) y <- y[!is.na(y)]
      n <- length(y)
      l <- qbinom(alpha/2, size=n, prob = q)
      u <- 1 + n - l
      ys <- sort.int(c(-Inf, y, Inf), partial = c(1 + l, 1 + u))
      data.frame(
        y = quantile(y, probs = q, na.rm=na.rm, type = 8),
        ymin = ys[1 + l],
        ymax = ys[1 + u]
      )
    }
    
    median_cl <- function(y, conf.level=0.95, na.rm=TRUE) quantile_cl(y, q=0.5, conf.level=conf.level, na.rm=na.rm)
    

    Example use:

    library(ggplot2)
    
    ggplot(iris, aes(x = Species, y = Sepal.Length)) +
      geom_violin(aes(fill = Species), trim = FALSE) +
      stat_summary(fun.data = mean_cl_normal, geom="crossbar", aes(color = "normal")) +
      stat_summary(fun.data = median_cl, aes(color = "nonparametric"), size = 1) +
      scale_color_manual(values = c("red", "black"), "method") +
      NULL
    

    These non-parametric confidence intervals for the median are always almost exact no matter the true distribution, but if you know the distribution you could construct better estimates & intervals. For example, if you knew your data came from a Gaussian mean_cl_normal would give the best estimates for the median (not of your sample but of the underlying distribution)!

    # coverage probability of ci
    mean(replicate(1000, {res <- quantile_cl(rnorm(100), q=0.5, conf.level=0.8); res$ymin<=0 & res$ymax>=0}))
    #> [1] 0.814
    mean(replicate(1000, {res <- mean_cl_normal(rnorm(100), conf.int = 0.8); res$ymin<=0 & res$ymax>=0}))
    #> [1] 0.804
    
    # mean width of ci
    mean(replicate(1000, {res <- quantile_cl(rnorm(100), q=0.5, conf.level=0.8); res$ymax-res$ymin}))
    #> [1] 0.3276851
    mean(replicate(1000, {res <- mean_cl_normal(rnorm(100), conf.int = 0.8); res$ymax-res$ymin}))
    #> [1] 0.2574329
    

    If there is demand, I'll try to get this function into Hmisc (where mean_cl_boot (smean.cl.normal) etc. are implemented).