Let’s construct a linear trend with some “noise” as follows:
X <- seq(0,20,1); Y <- X/4 + sin(X); plot(X,Y)
Now I smooth the dataset with loess
and plot “95% confidence intervals” with qt
and the loess standard error:
X.pred <- seq(0, 20, length.out = 1000) # To have smooth lines
Fit <- predict(loess(Y ~ X, span = 0.75), newdata = X.pred, se = TRUE)
lines(X.pred, Fit$fit)
lines(X.pred, Fit$fit + qt(0.975, Fit$df) * Fit$se.fit, lty = 3)
lines(X.pred, Fit$fit - qt(0.975, Fit$df) * Fit$se.fit, lty = 3)
The result is very intuitive, as the subjacent trend is clearly shown and the “band” covers almost all points (as expected in a 95% confidence interval).
The problem arises when we have many points to be fitted. Let’s increase the number of points by a factor of 10:
X <- seq(0, 20, 0.1); Y <- X/4 + sin(X); plot(X, Y)
When I run the same script as above, the band is now very narrow and clearly does not cover 95% of the points. How can I obtain “intuitive” loess
95% bands, irrespective of the number of fitted points?
This is a misunderstanding of the behaviour of confidence bands; for any reasonable form of estimation, the confidence intervals will shrink to zero as the sample size increases (even if the residual variance stays the same). You seem to be looking for prediction intervals. You can get these, approximately, by adding the residual variance (e.g. resid.sd <- loess(Y ~ X, span = 0.75)$s; resid.var <- resid.sd^2
) to the squared standard error of the fit, and using the square root of the sum as the standard deviation in qnorm()
. (It's not easy to combine a t-distributed distribution of parameter error and a Gaussian distribution of residual error ...)