Search code examples
rggplot2sampleconfidence-interval

How to plot a 95% confidence interval graph for one sample proportion


enter image description here

qplot(x="SignUp", y=0.07, ymin=Lower_Level, ymax=Upper_Level, ylim=c(0,1), geom = "pointrange")+coord_flip() +
  
  ylab("SignUp Proportion")+geom_hline(yintercept=Upper_Level)+geom_hline(yintercept=Lower_Level)

This is what I managed to plot. But I want something like the graph below. The confidence intervals are 0.084 and 0.0551. The sample proportion is 0.07

enter image description here


Solution

  • I guess you could show the 95% confidence interval for the estimated probability like this:

    First, start with a data frame of 1s and 0s representing your "success" and "failure" rate in the sample. Here, your numbers suggest approximately 105 out of 1500 successes, so we do:

    df <- data.frame(x = c(rep(1, 105), rep(0, 1395)))
    

    Now we fit a logistic regression with the intercept being the only parameter we are estimating:

    mod <- coef(summary(glm(x ~ 1, family = binomial, data = df)))
    
    mod
    #>              Estimate Std. Error  z value      Pr(>|z|)
    #> (Intercept) -2.586689  0.1011959 -25.5612 4.122466e-144
    

    The estimate here should be normally distributed (on the log odds scale) with the given estimate and standard error, so we can grab the density values over an appropriate range by doing:

    xvals <- seq(mod[1] - 3 * mod[2], mod[1] + 3 * mod[2], 0.01)
    yvals <- dnorm(xvals, mod[1], mod[2])
    

    Now we convert the x values from log odds to probabilities:

    pxvals <- exp(xvals)/(1 + exp(xvals))
    

    We will also create a vector that labels whether the values are within 1.96 standard deviations of the estimate:

    level <- ifelse(xvals < mod[1] - 1.96 * mod[2], "lower",
              ifelse(xvals > mod[1] + 1.96 * mod[2], "upper", "estimate"))
    

    Now we put all of these in a data frame and plot:

    plot_df <- data.frame(xvals, yvals, pxvals, level)
    
    library(ggplot2)
    
    ggplot(plot_df, aes(pxvals, yvals, fill = level)) +
      geom_area(alpha = 0.5) +
      geom_vline(xintercept = exp(mod[1])/(1 + exp(mod[1])), linetype = 2) + 
      scale_fill_manual(values = c("gray70", "deepskyblue4", "deepskyblue4"),
                        guide = guide_none()) +
      scale_x_continuous(limits = c(0.03, 0.13), breaks = 3:12/100,
                         name = "probability") +
      theme_bw()
    

    enter image description here

    Note that because we have transformed the x axis, this is no longer a genuine density plot. The y axis becomes somewhat arbitrary as a result, but the plot still shows accurately the 95% confidence interval for the probability estimate.


    EDIT

    Here's an alternative method if the glm approach seems too complicated. It uses the binomial distribution to get the 95% confidence intervals. You just supply it with the population size and the number of "successes"

    library(ggplot2)
    
    population <- 1500
    actual_successes <- 105
    test_successes <- 1:300
    
    density <- dbinom(test_successes, population, actual_successes/population)
    probs   <- pbinom(test_successes, population, actual_successes/population)
    label   <- ifelse(probs < 0.025, "low", ifelse(probs > 0.975, "high", "CI"))
    
    ggplot(data.frame(probability = test_successes/population, density, label),
           aes(probability, density, fill = label)) +
      geom_area(alpha = 0.5) +
      geom_vline(xintercept = actual_successes/population, linetype = 2) + 
      scale_fill_manual(values = c("gray70", "deepskyblue4", "deepskyblue4"),
                        guide = guide_none()) +
      scale_x_continuous(limits = c(0.03, 0.13), breaks = 3:12/100,
                         name = "probability") +
      theme_bw()
    

    enter image description here