Search code examples
rtime-seriesregressionsplinesmoothing

`smooth.spline` severely underfits long (periodic) time series


I would like to smooth very long, noisy data, in R. But I have found that for highly periodic data, out-of-the-box smooth.spline() quickly breaks down and the smoothed data begins to exhibit ringing.

Consider a cosine time series (with or without noise)

t <- seq(0,100*2*pi,length.out=3000)
y <- cos(t)# + rnorm(length(t), 0,0.05)

y100_s <- smooth.spline(y)$y

plot( y~t, type="l" )
lines( y100_s~t, col="blue" )

We can examine the effect of adding more values to smooth.spline(),

# rms increases as points are added to smooth.spline
rms <- sapply( seq(250,3000,by=250), function(i)
  sqrt( mean( (y[1:i] - smooth.spline(y[1:i])$y)^2 )) )

plot(rms)

Even at lower frequencies the fit is ringing (optional).

t <- seq(0,50*2*pi,length.out=3000)
y <- cos(t)# + rnorm(length(t), 0,0.05)

y50_s <- smooth.spline(y)$y

require(pracma)

peaks <- list(findpeaks(y50_s),findpeaks(-y50_s))

plot( y~t, type="l" )
lines( y50_s~t, col="red" )

lines( peaks[[1]][,1]~t[peaks[[1]][,2]], type="l" )
lines( -peaks[[2]][,1]~t[peaks[[2]][,2]], type="l" )

After exploring for a bit, this behaviour appears to be a function of the spar argument, but I can't set this to a small enough value to eliminate the effect. This might be an obvious consequence of spline fitting, and a fault of relying on out-of-the-box methods, but I would appreciate some insight. Are there controls I can specify in smooth.spline(), or alternative recommendations/strategies for smoothing?


Solution

  • I don't know whether you are always fitting a periodic signal. If that is the case, using periodic spline from mgcv::gam is much better. However, let's forget about this issue for the moment.

    If your data have high, frequent oscillation, you have to choose sufficient number of knots, i.e., a decent density of knots, otherwise you just result in over-smoothing (i.e., under-fitting).

    Have a look at your example:

    t <- seq(0, 100 * 2 * pi, length.out = 3000)
    y <- cos(t) # + rnorm(length(t), 0, 0.05)
    fit <- smooth.spline(t, y)
    

    You have n = 3000 data points. By default, smooth.spline uses much smaller number of knots than data when n > 49. Precisely it is selected by a service routine .nknots.smspl. But there is no optimality justification for this. So it is up to you to justify whether this is reasonable. Let's check:

    length(fit$fit$nk) - 2L  ## or `.nknots.smspl(3000)`
    # [1] 194
    
    fit$df
    # [1] 194
    

    It uses only 194 knots and model ends up with 194 degree of freedom without penalization effect. As I said earlier, you just end up with under-fitting:

    plot(t, y, type = "l", col = "gray")
    lines(fit, col = 2)
    

    enter image description here

    Ideally, penalized regression ends up with a degree of freedom substantially smaller than number of knots. It is often forgotten that penalization is used to fix over-fitting problem resulting from the original non-penalized regression. If we don't even see the penalization effect, then the original non-penalized model is under-fitting data, so increase number of knots until we reach an over-fitting status. If you are lazy to think about this, set all.knots = TRUE. Univariate smoothing spline is very computationally cheap at O(n) costs. Even if you use all data as knots, you won't get into efficiency problem.

    fit <- smooth.spline(t, y, all.knots = TRUE)
    
    length(fit$fit$nk) - 2L
    # [1] 3000
    
    fit$df
    # [1] 3000
    

    Oh, we still did not see the effect of penalization, why? Because we don't have noisy data. You did not add noise to your y, so by using all knots we are doing interpolation. Add some noise to y to truly understand what I explain about penalization.

    set.seed(0)
    t <- seq(0, 100 * 2 * pi, length.out = 3000)
    y <- cos(t) + rnorm(length(t), 0, 0.05)
    
    fit <- smooth.spline(t, y, all.knots = TRUE)
    
    length(fit$fit$nk)
    # [1] 3002
    
    fit$df
    # [1] 705.0414
    

    Note how much smaller 705 is compared with 3000. Have look at fitted spline?

    plot(t, y, type = "l", col = "gray")
    lines(fit, col = 2)
    

    There is neither under-fitting nor over-fitting; penalization results in optimal trade-off between bias and variance.

    enter image description here