Does anyone have an idea how to do stepwise regression with Tweedie in R?
I found the mgcv
package, which apparently treats the power parameter of Tweedie as yet another parameter to be estimated. This seems to improve on having to use tweedie.profile
to estimate the power outside the glm
, so it seems encouraging for using an automated stepwise function to do the regression. But I haven't been able to figure out if the package also offers a stepwise function. The package manual has this to say.
I got lost in the talk about smooths:
There is no step.gam in package mgcv.
To facilitate fully automatic model selection the package implements two smooth modification techniques which can be used to allow smooths to be shrunk to zero as part of smoothness selection.
I would appreciate your help. Thanks.
Your question is not specific to "Tweedie" family; it is a general mgcv
feature in model selection.
mgcv
does not use step.gam
for model selection. I think your confusion comes from another package gam
, which would use step.gam
to sequentially add/drop a term and reports AIC. When you go ?step.gam
in mgcv
, it refers you to ?gam.selection
. ?step.gam
is intentionally left there, in case people search it. But all the details are provided in ?gam.selection
.
There is no need to do step.gam
in mgcv
. Model estimation and model selection are integrated in mgcv
. For a penalized regression/smoothing spline, when smoothing parameter goes to infinity (very large), its second derivative is penalized to zero, leaving a simple linear term. For example, if we specify a model like:
y ~ s(x1, bs = 'cr') + s(x2, bs = 'cr')
while s(x2)
is a spurious model term and should not be included in the model, then mgcv:::gam/bam
will shrink s(x2)
to x2
after estimation, resulting a model like:
y ~ s(x1) + x2
This means, when you use plot.gam()
to inspect the estimated smooth function for each model term, s(x1)
is a curve, but s(x2)
is a straight line.
Now this is not entirely satisfying. For a complete, successful model selection, we want to drop x2
as well, i.e., shrink s(x2)
to 0, to get notationally a model:
y ~ s(x1)
But this is not difficult to achieve. We can use shrinkage smooth class bs = 'ts'
(shrinkage thin plate regression spline, as opposed to the ordinary tp
) or bs = cs'
(shrinkage cubic regression spline, as opposed to the ordinary 'cr'), and mgcv:::gam/bam
should be able to shrink s(x2)
to 0. The math behind this, is that mgcv
will modify the eigen values of linear term (i.e., the null space) from 0, to 0.1, a small but positive number, so that penalization takes effect on linear term. As a result, when you do plot.gam()
, you will see s(x2)
is a horizontal line at 0.
bs = 'cs'
or bs = 'ts'
are supposed to be put in function s()
; yet mgcv
also allows you to leave bs = 'cr'
or bs = 'tp'
untouched in s()
, but put select = TRUE
in gam()
or bam()
. The select = TRUE
is a more general treatment, as shrinkage smooths at the moment only have class cs
and ts
, while select = TRUE
work for all kind of smooth specification. They essentially do the same thing, by increasing 0 eigen values to 0.1.
The following example is taken from the example under ?gam.selection
. Note how select = TRUE
shrinks several terms to 0, giving an informative model selection.
library(mgcv)
set.seed(3);n<-200
dat <- gamSim(1,n=n,scale=.15,dist="poisson") ## simulate data
dat$x4 <- runif(n, 0, 1);dat$x5 <- runif(n, 0, 1) ## spurious
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3)+s(x4)+s(x5),data=dat,
family=poisson,select=TRUE,method="REML")
summary(b)
plot.gam(b,pages=1)
Note that, the p-values
in summary.gam()
also gives evidence for such selection:
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(x0) 1.7655119 9 5.264 0.0397 *
s(x1) 1.9271039 9 65.356 <2e-16 ***
s(x2) 6.1351372 9 156.204 <2e-16 ***
s(x3) 0.0002618 9 0.000 0.4088
s(x4) 0.0002766 9 0.000 1.0000
s(x5) 0.1757146 9 0.195 0.2963
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.545 Deviance explained = 51.6%
-REML = 430.78 Scale est. = 1 n = 200