Search code examples
rregressiongammgcvbspline

When using P-splines to fit data, does the dataset need to be divided into validation set and testing set?


I am doing the curve fitting by p-splines(with the basis b-splines) in a 1D dataset. I am struggling that in almost every regression model, it is essential to divide the dataset into a validation set and a testing set, to see if the data fits well or over-fitting. But in p-splines(#mgcv# in R), it is using GCV or AIC to control the location and numbers of knots to prevent the over-fit and generate the best lambda(or knots). In this situation, I am not quite sure if it needs to divide the dataset. If it needs, what's the difference between dividing dataset and not dividing dataset? Cause I read the paper and it said the penalty would help to prevent the over-fit.

And how to test if the fit is good or not? Due to the GCV only tells that I've chosen the best result in this model.

I tried using #mgcv# packages in R to do the fitting. And the codes are as follows:

dataset <- read.csv("bsplinesforR.csv", header = TRUE)

x <- dataset[,2]
y <- dataset[,4]
vdist <- hdist <- 0.2
layout( matrix(1:4, 2, 2, byrow=TRUE), widths=c(10,10), heights=c(10, 10))
par(mar= c(vdist, 4, 3, hdist))

library(gamlss)
gamlss.ps<- gamlss( y~pb(x), control=gamlss.control(trace=FALSE))
detach("package:gamlss")
library(mgcv)
mgcv.ps <- gam(y~s(x,bs="ps"), method = "GCV.Cp")
t.gl <- predict(gamlss.ps)
t.g <- predict(mgcv.ps)
lines(x, t.gl,col=2, lwd=2)
lines(x, t.g,col=3, lwd=2)
legend(5, 600, c("Real", "gamlss", "mgcv"), col=c(1, 2, 3), lty=c(2, 1, 1), bty="n")
layout(matrix(1, 1, 1))
legend("topright", "p-splines: default values", col=1 ,lty=0, bty="n")




Family: gaussian 
Link function: identity 

Formula:
y ~ s(x, bs = "ps")

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  327.182      4.945   66.17   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
       edf Ref.df     F p-value    
s(x) 5.294  6.123 214.5  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.961   Deviance explained = 96.4%
GCV = 1518.4  Scale est. = 1344.7    n = 55

Solution

  • GCV is the default way of selecting smoothness parameters in mgcv. It is one of several methods for doing this. GCV is the default because this was the best option available when Simon Wood was developing his ideas for estimating GAMs via penalized splines. Research results since then have shown that GCV can undersmooth data as the profile of the GCV score can become very flat around the optimal score; slight variations in the data drawn from a known model could yield very different selected values for smoothness parameters due to the flatness of the GCV score profile.

    REML or ML smoothness selection, wherein the GAM is estimated as a mixed effects model with smoothness parameters related to the random effects variance parameters, have both been shown to work better than GCV, being far less sensitive to undersmoothing. You should ideally be fitting GAMs in mgcv using method = "REML" or method = "ML".

    In most applications of GAMs fitted using mgcv, you wouldn't split-sample your data. Or at least you'd have to be very careful how you do it, because you have to create bases from the covariate(s) and you'd want to be careful not to extrapolate. There are also some other issues; CV a GAM takes a lot of time, and the process isn't invariant to orthogonal transformations of the covariates.

    There are ways to get smoother models if you worry about overfitting. One is to apply the one standard error rule, wherein we choose smoothness parameters that are within one standard error of the selected value. See ?one.se.rule for how to do this.

    Another option is double cross-validation; use GCV and set gamma = 1.5. In this procedure, what is happening is that smoothness parameters are chosen to minimise the difference between the predicted values for the response where those observations are included in the model and when they are excluded; i.e. choose smoothness parameters in such a way as to minimise the change in the model when you include observations in the model.