Search code examples
rsplinenon-linear-regressiongam

gam() in R: Is it a spline model with automated knots selection?


I run an analysis where I need to plot a nonlinear relation between two variables. I read about spline regression where one challenge is to find the number and the position of the knots. So I was happy to read in this book that generalized additive models (GAM) fit "spline models with automated selection of knots". Thus, I started to read how to do GAM analysis in R and I was surprised to see that the gam() function has a knots argument.

Now I am confused. Does the gam() function in R run a GAM which atomatically finds the best knots? If so, why should we provide the knots argument? Also, the documentation says "If they are not supplied then the knots of the spline are placed evenly throughout the covariate values to which the term refers. For example, if fitting 101 data with an 11 knot spline of x then there would be a knot at every 10th (ordered) x value". This does not sound like a very elaborated algorithm for knots selection.

I could not find another source validating the statement that GAM is a spline model with automated knots selection. So is the gam() function the same as pspline() where degree is 3 (cubic) with the difference that gam() sets some default for the df argument?


Solution

  • The term GAM covers a broad church of models and approaches to solve the smoothness selection problem.

    mgcv uses penalized regression spline bases, with a wiggliness penalty to choose the complexity of the fitted smooth(s). As such, it doesn't choose the number of knots as part of the smoothness selection.

    Basically, you as the user choose how large a basis to use for each smooth function (by setting argument k in the s(), te(), etc functions used in the model formula). The value(s) for k set the upper limit on the wiggliness of the smooth function(s). The penalty measures the wiggliness of the function (it is typically the squared second derivative of the smooth summed over the range of the covariate). The model then estimates values for the coefficients for the basis functions representing each smooth and chooses smoothness parameter(s) by maximizing the penalized log likelihood criterion. The penalized log likelihood is the log likelihood plus some amount of penalty for wiggliness for each smooth.

    Basically, you set the upper limit of expected complexity (wiggliness) for each smooth and when the model is fitted, the penalty(ies) shrink the coefficients behind each smooth so that excess wiggliness is removed from the fit.

    In this way, the smoothness parameters control how much shrinkage happens and hence how complex (wiggly) each fitted smooth is.

    This approach avoids the problems of choosing where to put the knots.

    This doesn't mean the bases used to represent the smooths don't have knots. In the cubic regression spline basis you mention, the value you give to k sets the dimensionality of the basis, which implies a certain number of knots. These knots are placed at the boundaries of the covariate involved in the smooth and then evenly over the range of the covariate, unless the user supplies a different set of knot locations. However, once the number of knots and their locations are set, thus forming the basis, they are fixed, with the wiggliness of the smooth being controlled by the wiggliness penalty, not by varying the number of knots.

    You have to be very careful also with R as there are two packages providing a gam() function. The original gam package provides an R version of the software and approach described in the original GAM book by Hastie and Tibshirani. This package doesn't fit GAMs using penalized regression splines as I describe above.

    R ships with the mgcv package, which fits GAMs using penalized regression splines as I outline above. You control the size (dimensionality) of the basis for each smooth using the argument k. There is no argument df.

    Like I said, GAMs are a broad church and there are many ways to fit them. It is important to know what software you are using and what methods that software is employing to estimate the GAM. Once you have that info in hand, you can home in on specific material for that particular approach to estimating GAMs. In this case, you should look at Simon Wood's book GAMs: an introduction with R as this describes the mgcv package and is written by the author of the mgcv package.