Search code examples
rggplot2regressionvisualizationconfidence-interval

Discrepancy in Confidence Intervals using plot() vs. ggplot2


I'm trying to plot the confidence intervals (CIs) for a regression line using both plot() and ggplot2 in R. However, I've noticed that the CI appears notably different between the two methods. I'd appreciate help in understanding why this discrepancy exists.

Below is the R code I've been working with:

# Loading libraries
library(ggplot2)
library(MethComp)

# Hypothethical data set for stackoverflow
set.seed(123) # Set seed for reproducibility
data_for_plot <- data.frame(rater1 = rnorm(100, mean = 50, sd = 10))
data_for_plot$rater2 <- data_for_plot$rater1 + runif(100, min = -20, max = 20)

# Prepare for x and y axis coordinates
common_min <- min(c(data_for_plot$rater1, data_for_plot$rater2))
common_min <- floor(common_min / 5) * 5
common_max <- max(c(data_for_plot$rater1, data_for_plot$rater2))
common_max <- ceiling(common_max / 5) * 5
common_breaks <- seq(from = common_min, to = common_max, length.out = 5)

# Passing-Bablok regression
pb_result <- PBreg(data_for_plot$rater1, data_for_plot$rater2)
pb_intercept <- pb_result$coefficients["Intercept", "Estimate"]
pb_intercept_lci <- pb_result$coefficients["Intercept", "2.5%CI"]
pb_intercept_uci <- pb_result$coefficients["Intercept", "97.5%CI"]
pb_slope <- pb_result$coefficients["Slope", "Estimate"]
pb_slope_lci <- pb_result$coefficients["Slope", "2.5%CI"]
pb_slope_uci <- pb_result$coefficients["Slope", "97.5%CI"]

# Manually creating CI ribbon (Problematic segment)
x_vals <- seq(from = common_min, to = common_max, length.out = 100)
upper_bound <- pb_intercept_uci + pb_slope_uci * x_vals
upper_bound <- pmin(upper_bound, common_max)
lower_bound <- pb_intercept_lci + pb_slope_lci * x_vals
lower_bound <- pmax(lower_bound, common_min)
ribbon_data <- data.frame(
  x = x_vals,
  ymin = lower_bound,
  ymax = upper_bound
)

# Generating plots
gg_plot_result <- ggplot(
  data = data_for_plot, aes(x = rater1, y = rater2)) +
  geom_ribbon(data = ribbon_data, aes(x = x, ymin = ymin, ymax = ymax), 
              fill = "#253494", alpha = 0.2, inherit.aes = FALSE)+
  geom_abline(intercept = 0, slope = 1, 
              color = "orange", size = 0.25, linetype="dashed") +
  geom_abline(intercept = pb_intercept, slope = pb_slope, 
              color = "#253494", linewidth = 1) +  # Plotting the regression line
  geom_point(shape = 21, colour = "black", fill = "white", size = 2.5) +
  labs(title = "Measurement (cm)", x = "Rater 1", y = "Rater 2") +           
  theme_minimal() +                                                      
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill = NA, linewidth = 0.5),
    plot.title = element_text(hjust = 0.5),
    aspect.ratio = 1
  ) +
  coord_fixed(ratio = 1) +
  scale_x_continuous(limits = c(common_min, common_max), breaks = common_breaks) +
  scale_y_continuous(limits = c(common_min, common_max), breaks = common_breaks)

# Generating PB plot directly
pb_plot_result <- plot(
  pb_result, asp = 1,
  xlim = c(common_min, common_max), 
  ylim = c(common_min, common_max),
  main = "Measurement (cm)",
  xlab = "Rater 1",
  ylab = "Rater 2"
)

print(gg_plot_result)

When using plot(), I observed a figure (let's call it figure 1) that displayed a relatively smaller CI.

Figure 1

On the other hand, with ggplot2, by manually calculating and adding the CI ribbon, the resulting CI appeared much larger. (Figure 2)

Figure 2

Can anyone shed light on what might be causing this difference? Specifically, what other methods can be used to plot confidence intervals with calculated CI of both the intercept and slope?

Any insights would be highly appreciated. Thank you.


Solution

  • You're computing confidence intervals on the predictions incorrectly. You have

    upper_bound <- pb_intercept_uci + pb_slope_uci * x_vals
    

    and a similar expression for the lower bound. For ordinary linear regression, the confidence intervals would be something like

    sqrt(var_int + 2*cov_int_slope*x + var_slope*x^2)
    

    but looking at the code of MethComp:::predict.PBreg shows that something very different (although presumably correct for this kind of analysis) is being computed. It's possible that this computation is described in the original Passing and Bablok (1983) reference given in ?PBreg: I haven't checked.