Search code examples
rregressionmixed-modelsgammgcv

mgcv_1.8-24: "fREML" or "REML" method of bam() gives wrong explained deviance


Fitting the same model with bam using methods "fREML" and "REML" gave me close results, but the deviance explained is rather different as returned by summary.gam.

With "fREML" the quantity is ~3.5% (not good) while with "REML" it is ~50% (not that bad). How can it be possible? Which one is correct?

Unfortunately, I cannot provide a simple reproducible example.

#######################################
## method = "fREML", discrete = TRUE ##
#######################################

Family: binomial 
Link function: logit 
Formula:
ObsOrRand ~ s(Var1, k = 3) + s(RandomVar, bs = "re")  
Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|) 
(Intercept)  -5.0026     0.2199  -22.75   <2e-16  
Approximate significance of smooth terms:
                  edf Ref.df Chi.sq  p-value 
s(Var1)          1.00  1.001  17.54 2.82e-05 
s(RandomVar)     16.39 19.000 145.03  < 2e-16  
R-sq.(adj) =  0.00349   Deviance explained = 3.57%
fREML = 2.8927e+05  Scale est. = 1         n = 312515

########################################
## method = "fREML", discrete = FALSE ##
########################################

Family: binomial 
Link function: logit 
Formula:
ObsOrRand ~ s(Var1, k = 3) + s(RandomVar, bs = "re")  
Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|) 
(Intercept)  -4.8941     0.2207  -22.18   <2e-16  
Approximate significance of smooth terms:
                  edf Ref.df Chi.sq  p-value 
s(Var1)          1.008  1.016  17.44 3.09e-05 
s(RandomVar)     16.390 19.000 144.86  < 2e-16  
R-sq.(adj) =  0.00349   Deviance explained = 3.57%
fREML = 3.1556e+05  Scale est. = 1         n = 312515

#####################################################
## method = "REML", discrete method not applicable ##
#####################################################

Family: binomial 
Link function: logit 
Formula:
ObsOrRand ~ s(Var1, k = 3) + s(RandomVar, bs = "re")  
Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|) 
(Intercept)  -4.8928     0.2205  -22.19   <2e-16  
Approximate significance of smooth terms:
                  edf Ref.df Chi.sq  p-value 
s(Var1)          1.156  1.278  16.57 8.53e-05 
s(RandomVar)     16.379 19.000 142.60  < 2e-16  
R-sq.(adj) =  0.0035   Deviance explained = 50.8%
-REML = 3.1555e+05  Scale est. = 1         n = 312515

Solution

  • This issue can be back tracked to mgcv_1.8-23. Its changlog for reads:

    * bam extended family extension had introduced a bug in null deviance 
      computation for Gaussian additive case when using methods other than fREML 
      or GCV.Cp. Fixed.
    

    It now turns out that the patch is successful for Gaussian case, but not for non-Gaussian cases.


    Let me first provide a reproducible example, as your question does not have one.

    set.seed(0)
    x <- runif(1000)
    ## the linear predictor is a 3rd degree polynomial
    p <- binomial()$linkinv(0.5 + poly(x, 3) %*% rnorm(3) * 20)
    ## p is well spread out on (0, 1); check `hist(p)`
    y <- rbinom(1000, 1, p)
    
    library(mgcv)
    #Loading required package: nlme
    #This is mgcv 1.8-24. For overview type 'help("mgcv-package")'.
    
    fREML <- bam(y ~ s(x, bs = 'cr', k = 8), family = binomial(), method = "fREML")
    REML <- bam(y ~ s(x, bs = 'cr', k = 8), family = binomial(), method = "REML")
    GCV <- bam(y ~ s(x, bs = 'cr', k = 8), family = binomial(), method = "GCV.Cp")
    
    ## explained.deviance = (null.deviance - deviance) / null.deviance
    ## so in this example we get negative explained deviance for "REML" method
    
    unlist(REML[c("null.deviance", "deviance")])
    #null.deviance      deviance 
    #     181.7107     1107.5241 
    
    unlist(fREML[c("null.deviance", "deviance")])
    #null.deviance      deviance 
    #     1357.936      1107.524 
    
    unlist(GCV[c("null.deviance", "deviance")])
    #null.deviance      deviance 
    #     1357.936      1108.108 
    

    Null deviance can't be smaller than deviance (TSS can not be smaller than RSS), so "REML" method of bam fails to return the correct Null deviance here.

    I have located the problem at line 1350 of mgcv_1.8-24/R/bam.r:

    object$family <- object$fitted.values <- NULL
    

    Actually, it should be

    object$null.deviance <- object$fitted.values <- NULL
    

    For methods other than "GCV.Cp" and "fREML", bam relies on gam for estimation, after reducing the large n x p model matrix to a p x p matrix (n: number of data; p: number of coefficients). Since this new model matrix does not have a natural interpretation, many quantities returned by gam should be invalidated (apart from estimated smooth parameters). It was a typo for Simon to put family.

    I build a patched version and it turns out to fix the bug. I will tell Simon to get it fixed in the next release.