Search code examples
rggplot2bootlinegraph

Line graphs with 95% bootstrap confidence interval


Hi I matched a sample of individuals from the treatment group and the control group by using the genetic matching approach from the MatchIt package. The outcome variables is the average spending of each month. Now I'm trying to generate a line graph to demonstrate patterns of average monthly spending of the treatment and control groups. I would like to add shaded areas to show the bootstrap confidence interval for these lines. The main issue that I encountered include:

1. Bootstrap in pairs

Bootstrap resampling for matched sample should be conducted based on each matched pair, not each individual. Thankfully I've got the solution from this post. The code will be provided in the example.

2. Plotting bootstrap confidence interval

I'm not sure how I can plot the bootstrap confidence intervals on the graphs.

Here I provide an example:

library(dplyr)
library(tidyr)
library(ggplot2)

ID <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L")
Pair <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6)
Treatment <- c(1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0)
Month_1 <- c(300, 150, 450, 100, 200, 300, 400, 600, 650, 150, 200, 400)
Month_2 <- c(400, 600, 650, 150, 200, 400, 500, 250, 700, 200, 300, 500)
Month_3 <- c(500, 250, 700, 200, 300, 500, 500, 250, 700, 200, 300, 500)
Month_4 <- c(600, 700, 650, 250, 500, 550, 600, 700, 650, 250, 500, 550)
Month_5 <- c(700, 200, 800, 300, 900, 800, 600, 700, 650, 250, 500, 550)

    df <- data.frame(ID, Pair, Treatment, Month_1, Month_2, Month_3, Month_4, Month_5)

> df

   ID Pair Treatment Month_1 Month_2 Month_3 Month_4 Month_5
1   A    1         1     300     400     500     600     700
2   B    1         0     150     600     250     700     200
3   C    2         1     450     650     700     650     800
4   D    2         0     100     150     200     250     300
5   E    3         1     200     200     300     500     900
6   F    3         0     300     400     500     550     800
7   G    4         1     400     500     500     600     600
8   H    4         0     600     250     250     700     700
9   I    5         1     650     700     700     650     650
10  J    5         0     150     200     200     250     250
11  K    6         1     200     300     300     500     500
12  L    6         0     400     500     500     550     550

In this dataset, "Month_1-5" denote the monthly spending of each individual; "Treatment" denotes if the individual is in the treatment group (coded as 1) or control group (coded as 0); and "Pair" indicates each matched pair. For example, individual A and B is in the sample pair, because they share the same Pair number of "1". Therefore, after bootstrap resampling, if A appears in any sample, B should appear as well.

Here I provide a code to conduct bootstrap resampling and to calculate mean spending for each month across treatment and control groups:

### 95% paired bootstrap confidence interval
library(boot)
df <- df %>%
  mutate(Pair = as.factor(Pair))
pair_ids <- levels(df$Pair)

# For mean of Month_1 in control group
est_control <- function(pairs, i) {
  
  # Compute number of times each pair is present
  numreps <- table(pairs[i])
  
  # For each pair p, copy corresponding data indices numreps[p] times
  ids <- unlist(lapply(pair_ids[pair_ids %in% names(numreps)],
                       function(p) rep(which(df$Pair == p),
                                       numreps[p])))

  # Subset df with paired bootstrapped ids
  md_boot <- df[ids,]
                
  # Estimation
  boot_control <- mean(md_boot[md_boot$Treatment == 0, "Month_1"])

  # Return the mean
  return(boot_control)
}
set.seed(1234)
boot_est <- boot(pair_ids, est_fun, R = 10000)
boot.ci(boot.out = boot_est, type = "bca")

# For mean of Month_1 in treatment group
est_treat <- function(pairs, i) {
  
  # Compute number of times each pair is present
  numreps <- table(pairs[i])
  
  # For each pair p, copy corresponding data indices numreps[p] times
  ids <- unlist(lapply(pair_ids[pair_ids %in% names(numreps)],
                       function(p) rep(which(df$Pair == p),
                                       numreps[p])))
  
  # Subset df with paired bootstrapped ids
  md_boot <- df[ids,]
  
  # Estimation
  boot_treat <- mean(md_boot[md_boot$Treatment == 1, "Month_1"])
  
  # Return the mean
  return(boot_treat)
}
set.seed(1234)
boot_est <- boot(pair_ids, est_treat, R = 10000)
boot.ci(boot.out = boot_est, type = "bca")

In this case, the bootstrap confidence interval of "Month_1" for control group is [158.3, 441.7], and that for treatment group is [250, 500].

Then I calculate the average monthly spending for the control and treatment groups and then generate the line graph:

Monthly_spending <- df %>%
  group_by(Treatment) %>%
  summarise("1" = mean(Month_1, na.rm = T),
            "2" = mean(Month_2, na.rm = T),
            "3" = mean(Month_3, na.rm = T),
            "4" = mean(Month_4, na.rm = T),
            "5" = mean(Month_1, na.rm = T)) %>%
  pivot_longer(c("1":"5"), names_to = "Month", values_to = "Spending")

Monthly_spending %>%
  mutate(Treatment = as.factor(Treatment)) %>%
  ggplot(aes(x = Month, y = Spending, group = Treatment, color = Treatment)) + 
  geom_point() + 
  geom_line(aes(group = Treatment)) +
  ggtitle("Monthly Spending") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_colour_discrete("Groups", labels = c("Control", "Treatment")) +
  theme(legend.position = "bottom") 

enter image description here

However, I don't know how to add shaded areas around these two lines to indicate the bootstrap confidence intervals from the code that I provided above. Alternatively, I could run the code for each time point again and again, and then document that in the dataset manually to plot the intervals on the graphs. However, I wonder if there's a way to conduct it more efficiently. I would be really grateful for your help.


Solution

  • This seems like a very complex way to achieve your end goal. The ggplot2 function mean-cl_boot takes an input vector and returns y, ymin and ymax values calculated by a fast bootstrap method. It works directly inside summarize too, so we can replace your whole code with:

    df %>%
      pivot_longer(starts_with("Month"), names_to = "Month") %>%
      group_by(Month, Treatment) %>%
      summarize(mean_cl_boot(value)) %>%
      mutate(Month = as.numeric(sub("Month_", "", Month)),
             Treatment = c("Control", "Treatment")[Treatment + 1]) %>%
      ggplot(aes(Month, y, color = Treatment)) +
      geom_ribbon(aes(fill = Treatment, ymin = ymin, ymax = ymax), 
                  alpha = 0.1, color = NA) +
      geom_line(size = 1) +
      geom_point(size = 3) +
      scale_fill_brewer(palette = "Set1") +
      scale_color_brewer(palette = "Set1") +
      labs(y = "Monthly spending") +
      theme_minimal(base_size = 16) +
      theme(legend.position = "bottom")
    

    enter image description here