LOESS confidence intervals excessively narrow in qqplot2

I'm having some trouble understanding how confidence intervals are calculated in ggplot2 while using LOESS smoothing. From a few other threads, my understanding is that ggplot2 uses t-intervals calculated based on regression standard errors, i.e., using the distance between the actual data points and the LOESS line. But I think I must be mistaken based on the confidence intervals that ggplot2 produces. Here's example code (actually qplot in this case, but I think the result should be the same):

qplot(Year, Purposivism, data=fig1.dat, geom=c('point', 'smooth'), level=0.99, span=0.5, method='loess', ylab="Term Frequency per Million Words") +
theme_bw() +
theme(text=element_text(family="Century", size=12)) +
expand_limits(y = 0) +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) +
theme(axis.text.x = element_text(angle = 45, hjust=1))

Here's the resulting graph:

Graph

On the left side of the graph (say, 1920-1940) the points are tightly packed around the LOESS line, and are mostly inside the confidence intervals. But from around 1960-1980, they're all over the place, yet the width of the confidence interval seems about the same. I think I must be misunderstanding how the CIs work, because this seems unintuitive.

Thanks in advance for your help! Very happy to provide any other information that might be useful.

Solution

Where you are likely confused is in the difference between confidence and prediction interval. Confidence intervals, which are used in geom_smooth are the predicted confidence in the estimated mean. This is a measure of how far the average of your observations will deviate at the point estimate. In predict.lm there is an option to add interval = "prediction", which would give you the prediction interval. The prediction interval incorporates the uncertainty in the error-term from y ~ x %*% beta + epsilon, while the confidence interval only incorporates the fixed effect uncertainty from y ~ x %*% beta. I have not looked into prediction intervals for loess curves, and other non- and semi-parametric smoothers, but it does not seem to be implemented in ?predict.loess

We can illustrate how geom_smooth estimates the confidence intervals by manual calculation. Lets start by using the most boring reproducible example. mtcars from the stats package (included in base R).

data(mtcars)
fit <- loess(mpg ~ hp, data = mtcars)
preds <- predict(fit, se = TRUE)
names(preds)
#[1] "fit"            "se.fit"         "residual.scale" "df"

To calculate the confidence interval, we use the standard formula as you correctly specified.

T <- qt(p = 0.975, df = preds$df)
lwr <- preds$fit - T * preds$se.fit
upr <- preds$fit + T * preds$se.fit

To create a proper plot of the confidence interval i merge all the necessary info into a single data.frame, while ordering the input, to ensure proper line order.

ord <- order(mtcars$hp)
plotData <- data.frame(lwr = lwr[ord], 
                       upr = upr[ord], 
                       fit = preds$fit[ord], 
                       hp = mtcars$hp[ord], 
                       mpg = mtcars$mpg[ord])

Last but not least we simply need to create the plot, and compare it to the one produced by ggplot2

p1 <- ggplot(plotData, aes(x = hp, ymax = upr, ymin = lwr)) + 
    #Data points
    geom_point(aes(y = mpg)) + 
    #Line from prediction
    geom_line(aes(y = fit)) + 
    #Points from prediction
    geom_point(aes(y = fit)) + 
    #Confidence interval
    geom_ribbon(alpha = 0.3, col = "thistle1") + 
    labs(title = "manual")
p2 <- ggplot(mtcars, aes(x = hp, y = mpg)) + 
    geom_point() + 
    geom_smooth() + 
    labs(title = "ggplot2")
#Merge plots
library(gridExtra)
grid.arrange(p1, p2, ncol = 1)

Now for the output:

Except for some smoothing done by ggplot, and the added points for the fitted values this is easily seen to be identical.

I hope this clears out how the points confidence interval is calculated.