I would like to smooth very long, noisy data, in R. But I have found that for highly periodic data, out-of-the-box smooth.spline()
quickly breaks down and the smoothed data begins to exhibit ringing.
Consider a cosine time series (with or without noise)
t <- seq(0,100*2*pi,length.out=3000)
y <- cos(t)# + rnorm(length(t), 0,0.05)
y100_s <- smooth.spline(y)$y
plot( y~t, type="l" )
lines( y100_s~t, col="blue" )
We can examine the effect of adding more values to smooth.spline()
,
# rms increases as points are added to smooth.spline
rms <- sapply( seq(250,3000,by=250), function(i)
sqrt( mean( (y[1:i] - smooth.spline(y[1:i])$y)^2 )) )
plot(rms)
Even at lower frequencies the fit is ringing (optional).
t <- seq(0,50*2*pi,length.out=3000)
y <- cos(t)# + rnorm(length(t), 0,0.05)
y50_s <- smooth.spline(y)$y
require(pracma)
peaks <- list(findpeaks(y50_s),findpeaks(-y50_s))
plot( y~t, type="l" )
lines( y50_s~t, col="red" )
lines( peaks[[1]][,1]~t[peaks[[1]][,2]], type="l" )
lines( -peaks[[2]][,1]~t[peaks[[2]][,2]], type="l" )
After exploring for a bit, this behaviour appears to be a function of the spar argument, but I can't set this to a small enough value to eliminate the effect. This might be an obvious consequence of spline fitting, and a fault of relying on out-of-the-box methods, but I would appreciate some insight. Are there controls I can specify in smooth.spline()
, or alternative recommendations/strategies for smoothing?
I don't know whether you are always fitting a periodic signal. If that is the case, using periodic spline from mgcv::gam
is much better. However, let's forget about this issue for the moment.
If your data have high, frequent oscillation, you have to choose sufficient number of knots, i.e., a decent density of knots, otherwise you just result in over-smoothing (i.e., under-fitting).
Have a look at your example:
t <- seq(0, 100 * 2 * pi, length.out = 3000)
y <- cos(t) # + rnorm(length(t), 0, 0.05)
fit <- smooth.spline(t, y)
You have n = 3000
data points. By default, smooth.spline
uses much smaller number of knots than data when n > 49
. Precisely it is selected by a service routine .nknots.smspl
. But there is no optimality justification for this. So it is up to you to justify whether this is reasonable. Let's check:
length(fit$fit$nk) - 2L ## or `.nknots.smspl(3000)`
# [1] 194
fit$df
# [1] 194
It uses only 194 knots and model ends up with 194 degree of freedom without penalization effect. As I said earlier, you just end up with under-fitting:
plot(t, y, type = "l", col = "gray")
lines(fit, col = 2)
Ideally, penalized regression ends up with a degree of freedom substantially smaller than number of knots. It is often forgotten that penalization is used to fix over-fitting problem resulting from the original non-penalized regression. If we don't even see the penalization effect, then the original non-penalized model is under-fitting data, so increase number of knots until we reach an over-fitting status. If you are lazy to think about this, set all.knots = TRUE
. Univariate smoothing spline is very computationally cheap at O(n)
costs. Even if you use all data as knots, you won't get into efficiency problem.
fit <- smooth.spline(t, y, all.knots = TRUE)
length(fit$fit$nk) - 2L
# [1] 3000
fit$df
# [1] 3000
Oh, we still did not see the effect of penalization, why? Because we don't have noisy data. You did not add noise to your y
, so by using all knots we are doing interpolation. Add some noise to y
to truly understand what I explain about penalization.
set.seed(0)
t <- seq(0, 100 * 2 * pi, length.out = 3000)
y <- cos(t) + rnorm(length(t), 0, 0.05)
fit <- smooth.spline(t, y, all.knots = TRUE)
length(fit$fit$nk)
# [1] 3002
fit$df
# [1] 705.0414
Note how much smaller 705 is compared with 3000. Have look at fitted spline?
plot(t, y, type = "l", col = "gray")
lines(fit, col = 2)
There is neither under-fitting nor over-fitting; penalization results in optimal trade-off between bias and variance.