Search code examples
rggplot2continuous-integrationloess

LOESS confidence intervals excessively narrow in qqplot2


I'm having some trouble understanding how confidence intervals are calculated in ggplot2 while using LOESS smoothing. From a few other threads, my understanding is that ggplot2 uses t-intervals calculated based on regression standard errors, i.e., using the distance between the actual data points and the LOESS line. But I think I must be mistaken based on the confidence intervals that ggplot2 produces. Here's example code (actually qplot in this case, but I think the result should be the same):

qplot(Year, Purposivism, data=fig1.dat, geom=c('point', 'smooth'), level=0.99, span=0.5, method='loess', ylab="Term Frequency per Million Words") +
theme_bw() +
theme(text=element_text(family="Century", size=12)) +
expand_limits(y = 0) +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) +
theme(axis.text.x = element_text(angle = 45, hjust=1))

Here's the resulting graph:

Graph

On the left side of the graph (say, 1920-1940) the points are tightly packed around the LOESS line, and are mostly inside the confidence intervals. But from around 1960-1980, they're all over the place, yet the width of the confidence interval seems about the same. I think I must be misunderstanding how the CIs work, because this seems unintuitive.

Thanks in advance for your help! Very happy to provide any other information that might be useful.


Solution

  • Where you are likely confused is in the difference between confidence and prediction interval. Confidence intervals, which are used in geom_smooth are the predicted confidence in the estimated mean. This is a measure of how far the average of your observations will deviate at the point estimate. In predict.lm there is an option to add interval = "prediction", which would give you the prediction interval. The prediction interval incorporates the uncertainty in the error-term from y ~ x %*% beta + epsilon, while the confidence interval only incorporates the fixed effect uncertainty from y ~ x %*% beta. I have not looked into prediction intervals for loess curves, and other non- and semi-parametric smoothers, but it does not seem to be implemented in ?predict.loess

    We can illustrate how geom_smooth estimates the confidence intervals by manual calculation. Lets start by using the most boring reproducible example. mtcars from the stats package (included in base R).

    data(mtcars)
    fit <- loess(mpg ~ hp, data = mtcars)
    preds <- predict(fit, se = TRUE)
    names(preds)
    #[1] "fit"            "se.fit"         "residual.scale" "df" 
    

    To calculate the confidence interval, we use the standard formula as you correctly specified.

    T <- qt(p = 0.975, df = preds$df)
    lwr <- preds$fit - T * preds$se.fit
    upr <- preds$fit + T * preds$se.fit
    

    To create a proper plot of the confidence interval i merge all the necessary info into a single data.frame, while ordering the input, to ensure proper line order.

    ord <- order(mtcars$hp)
    plotData <- data.frame(lwr = lwr[ord], 
                           upr = upr[ord], 
                           fit = preds$fit[ord], 
                           hp = mtcars$hp[ord], 
                           mpg = mtcars$mpg[ord])
    

    Last but not least we simply need to create the plot, and compare it to the one produced by ggplot2

    p1 <- ggplot(plotData, aes(x = hp, ymax = upr, ymin = lwr)) + 
        #Data points
        geom_point(aes(y = mpg)) + 
        #Line from prediction
        geom_line(aes(y = fit)) + 
        #Points from prediction
        geom_point(aes(y = fit)) + 
        #Confidence interval
        geom_ribbon(alpha = 0.3, col = "thistle1") + 
        labs(title = "manual")
    p2 <- ggplot(mtcars, aes(x = hp, y = mpg)) + 
        geom_point() + 
        geom_smooth() + 
        labs(title = "ggplot2")
    #Merge plots
    library(gridExtra)
    grid.arrange(p1, p2, ncol = 1)
    

    Now for the output: Image of loess smoother produced by the described code

    Except for some smoothing done by ggplot, and the added points for the fitted values this is easily seen to be identical.

    I hope this clears out how the points confidence interval is calculated.