Search code examples
rmixed-modelsgam

Model selection with beta and quassi families using gamm4


I have two responses which conform to beta (also known as betar) and Poisson families, and I am looking into fitting additive mixed-models with beta and quasi-families (count data is over-dispersed), respectively.

I am aware that I could use gamm function from mgcv package which accepts both beta and quassi-families, however I am considering that it uses PQL, and the AIC reported is not useful for comparing models - which is the primary objective of my analyses.

In the case of the count response, I am aware that QAIC has been used for ranking/comparing overdispersed mixed models but I cannot find anything that says this it is appropriate for overdispersed GAMM.

I understand these are potentially two questions in one but they both have a common theme of model selection with extended families and potentially have different solutions. Below I provide reproducible examples for each case.

##generate data
library(gamm4)
library(mgcv)
dat <- gamSim(1,n=400,scale=2)
dat<-subset(dat, select=c(x0,x1,x2,x3,f) )
dat$g <- as.factor(sample(1:20,400,replace=TRUE))#random factor
dat$yb<-runif(400)#yb ranges between 0-1 hence fitted with beta family
dat$f <- dat$f + model.matrix(~ g-1)%*%rnorm(20)*2
dat$yp <- rpois(400,exp(dat$f/7))#y2 is counts hence poisson family

#beta family example with gamm function (this works - however not sure if the subsequent model comparisons are valid!) 
m1b<-    gamm(yb~s(x0)+s(x1)+s(x2)+s(x3),family=betar(link='logit'),data=dat,random=list(g=~1))
m2b<-gamm(yb~s(x1)+s(x2)+s(x3),family=betar(link='logit'),data=dat,random=list(g=~1))
m3b<-gamm(yb~s(x0)+s(x2)+s(x3),family=betar(link='logit'),data=dat,random=list(g=~1))

#AIC to compare models  
AIC(m1b,m2b,m3b)

#try the same using gamm4 (ideally)- it obviously fails with beta family.
m<-gamm4(yb~s(x0)+s(x1)+s(x2)+s(x3),family=betar(link='logit'),data=dat,random = ~ (1|g)) 

##Example with quassi family - yp response is overdispersed count data (may not be overdispered in this example
#example using gamm function
m1p<-gamm(yp~s(x0)+s(x1)+s(x2)+s(x3),family = quasipoisson,data=dat,random=list(g=~1))
m2p<-gamm(yp~s(x1)+s(x2)+s(x3),family = quasipoisson,data=dat,random=list(g=~1))
m3p<-gamm(yp~s(x0)+s(x2)+s(x3),family = quasipoisson,data=dat,random=list(g=~1))

#AIC to compare models
AIC(m1p,m2p,m3p)

#again the example with using gamm4 function will not work as it doesnt accept quassi falimies 
m<-gamm4(yp~s(x0)+s(x1)+s(x2)+s(x3),family = quasipoisson,data=dat,random = ~ (1|g))

Solution

  • You have a bunch of questions here, but I'll try to tackle them. Basically, you want to fit parametric statistical models with

    • random effects (nlme, lme4)
    • distributions from the exponential family ... (MASS::glmmPQL, lme4::glmer)
    • ... with overdispersion ...
    • ... or distributions beyond the exponential family such as the Beta distribution (VGAM, betareg)
    • additive models/splines (splines) ...
    • ... or penalized regression splines, which automatically adjust the complexity of the smooth terms
    • ... with a real likelihood model rather than a marginal or quasi-likelihood model (e.g. GEEs, PQL), so you can do classic inference

    Each of the specified issues above adds 1 or more "difficulty points" to a model-fitting exercise ... usually once your score goes beyond about +3 or so, you have to find a way to compromise or take shortcuts on some of the things you want. You've correctly identified gamm and gamm4 as doing some of the stuff you want, but you can't get everything. Some suggestions:

    Overdispersion

    One way to handle overdispersion is with an observation-level random effect, e.g.

    dat$obs <- factor(seq(nrow(dat)))
    m <- gamm4(yp~s(x0)+s(x1)+s(x2)+s(x3),
               family = poisson,data=dat,random = ~ (1|g)+(1|obs))
    

    Another alternative is to adjust the overdispersion yourself, if you think that makes sense, e.g.:

    m0 <- gamm4(yp~s(x0)+s(x1)+s(x2)+s(x3),family = poisson,data=dat,random = ~ (1|g))
    

    First compute overdispersion:

    (phi <- sum(residuals(m0$gam,type="pearson")^2/df.residual(m0$gam)))
    ## [1] 1.003436
    

    (if we repeat this exercise with m0$mer instead we get 0.9939696: the result is almost exactly equal to 1 because we generated data from a Poisson distribution in the first place ...)

    (qaic <- -2*logLik(m0$mer)/phi + 2*lme4:::npar.merMod(m0$mer))
    

    N.B. I am guessing that it makes sense to construct the likelihoods, etc. from the individual components of a gamm4 fit in this way; use at your own risk

    Alternative distributions

    The glmmADMB and glmmTMB packages (both off-CRAN but findable via Google ...) can both handle mixed Beta models. They can't do penalized regression splines, but you can use regular splines via splines::ns() or splines::bs() (but you do have to decide on the appropriate level of complexity -- maybe you can guess from preliminary gamm or mgcv fits ...)

    library(glmmADMB)
    library(splines)
    m3b <- glmmadmb(yb~ns(x0,2)+ns(x1,2)+ns(x2,5)+ns(x3,2)+(1|g),
           family="beta",link="logit",data=dat)
    

    The glmmTMB package can in principle do this:

    library(glmmTMB)
    m2b <- glmmTMB(yb~ns(x0,2)+ns(x1,2)+ns(x2,5)+ns(x3,2)+(1|g),
           family=list(family="beta",link="logit"),data=dat)
    

    but the package is in development and the current set of results don't make sense -- so I might hesitate to use it at this point.