I'm having some trouble understanding how confidence intervals are calculated in ggplot2
while using LOESS smoothing. From a few other threads, my understanding is that ggplot2
uses t-intervals calculated based on regression standard errors, i.e., using the distance between the actual data points and the LOESS line. But I think I must be mistaken based on the confidence intervals that ggplot2
produces. Here's example code (actually qplot
in this case, but I think the result should be the same):
qplot(Year, Purposivism, data=fig1.dat, geom=c('point', 'smooth'), level=0.99, span=0.5, method='loess', ylab="Term Frequency per Million Words") +
theme_bw() +
theme(text=element_text(family="Century", size=12)) +
expand_limits(y = 0) +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) +
theme(axis.text.x = element_text(angle = 45, hjust=1))
Here's the resulting graph:
On the left side of the graph (say, 1920-1940) the points are tightly packed around the LOESS line, and are mostly inside the confidence intervals. But from around 1960-1980, they're all over the place, yet the width of the confidence interval seems about the same. I think I must be misunderstanding how the CIs work, because this seems unintuitive.
Thanks in advance for your help! Very happy to provide any other information that might be useful.
Where you are likely confused is in the difference between confidence and prediction interval. Confidence intervals, which are used in geom_smooth
are the predicted confidence in the estimated mean. This is a measure of how far the average of your observations will deviate at the point estimate. In predict.lm
there is an option to add interval = "prediction"
, which would give you the prediction interval. The prediction interval incorporates the uncertainty in the error-term from y ~ x %*% beta + epsilon
, while the confidence interval only incorporates the fixed effect uncertainty from y ~ x %*% beta
. I have not looked into prediction intervals for loess
curves, and other non- and semi-parametric smoothers, but it does not seem to be implemented in ?predict.loess
We can illustrate how geom_smooth
estimates the confidence intervals by manual calculation. Lets start by using the most boring reproducible example. mtcars
from the stats
package (included in base R
).
data(mtcars)
fit <- loess(mpg ~ hp, data = mtcars)
preds <- predict(fit, se = TRUE)
names(preds)
#[1] "fit" "se.fit" "residual.scale" "df"
To calculate the confidence interval, we use the standard formula as you correctly specified.
T <- qt(p = 0.975, df = preds$df)
lwr <- preds$fit - T * preds$se.fit
upr <- preds$fit + T * preds$se.fit
To create a proper plot of the confidence interval i merge all the necessary info into a single data.frame
, while ordering the input, to ensure proper line order.
ord <- order(mtcars$hp)
plotData <- data.frame(lwr = lwr[ord],
upr = upr[ord],
fit = preds$fit[ord],
hp = mtcars$hp[ord],
mpg = mtcars$mpg[ord])
Last but not least we simply need to create the plot, and compare it to the one produced by ggplot2
p1 <- ggplot(plotData, aes(x = hp, ymax = upr, ymin = lwr)) +
#Data points
geom_point(aes(y = mpg)) +
#Line from prediction
geom_line(aes(y = fit)) +
#Points from prediction
geom_point(aes(y = fit)) +
#Confidence interval
geom_ribbon(alpha = 0.3, col = "thistle1") +
labs(title = "manual")
p2 <- ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_point() +
geom_smooth() +
labs(title = "ggplot2")
#Merge plots
library(gridExtra)
grid.arrange(p1, p2, ncol = 1)
Except for some smoothing done by ggplot
, and the added points for the fitted values this is easily seen to be identical.
I hope this clears out how the points confidence interval is calculated.