Search code examples
gammgcv

Problem with GAM fitting using gam.check() to check the basis dimension results


I'm using the R package 'mgcv' to create GAMs for predicting seabird distributions. I'm having some problems interpreting the results from gam.check() for some of my models (one model per species).

I'm using a negative binomial family and log link, with an offset of log of the sample area. I have 13 possible predictor variables. My response data is discrete count data from surveys, it is highly zero-inflated.

In each candidate I check for concurvity (using 0.8 as a threshold in the entire model), and that my model terms are significant using summary(),(otherwise they are dropped), and the model converges each time. Despite this, I am still getting highly significant results from gam.check(). This is in spite of the edf being much lower than k. I have tried increasing k a lot but it doesn't make a difference to the results nor often to the edf, I have tried using a tweedie and zero-inflated poisson (ziP) families too but that also doesn't make much of a difference. I will say, however, that my QQ plots do not look fantastic.

QQ_plot

Does anyone know why this might occur? I assume the model is unstable and should not be used, however I have no idea how to fix it. It's happening for almost all of the species too, so it's not specific to one file.

Here is my model

   gam_nb <- gam(COUNT_SUM ~ s(X_MEAN_2, k=25) + s(Currents_Northward, k=25), 
data=my_data,family = "nb", select=TRUE, offset=LOG_AREA_SUM, method = "REML")

And here are the diagnostic results

Method: REML   Optimizer: outer newton
full convergence after 7 iterations.
Gradient range [-0.002334867,0.0005403022]
(score 1783.03 & scale 1).
Hessian positive definite, eigenvalue range [1.846342e-06,121.4505].
Model rank =  49 / 49 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

                        k'  edf k-index p-value    
s(X_MEAN_2)           24.0 11.4    0.66  <2e-16 ***
s(Currents_Northward) 24.0 11.8    0.69   0.025 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Now see what happens when I increase k

gam_nb <- gam(COUNT_SUM ~ s(X_MEAN_2, k=100) + s(Currents_Northward, k=100), 
data=my_data,family = "nb", select=TRUE, offset=LOG_AREA_SUM, method = "REML")


Method: REML   Optimizer: outer newton
full convergence after 7 iterations.
Gradient range [-6.715822e-05,1.121576e-05]
(score 1783.004 & scale 1).
Hessian positive definite, eigenvalue range [7.32924e-06,120.848].
Model rank =  199 / 199 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

                        k'  edf k-index p-value    
s(X_MEAN_2)           99.0 12.0    0.67  <2e-16 ***
s(Currents_Northward) 99.0 13.1    0.68  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Solution

  • You shouldn't be doing model selection by retaining terms on the basis of the Wald-like significance test of the terms/smooths in the summary() output; you are destined to bias your estimated effects low (i.e. 0 assumed when excluded) or high (as extreme effect/functions are more likely to be significant, but also potentially spuriously so).

    You're using select = TRUE so just include the set of covariates that you think are important controls on the response and fit that single model and then interpret that.

    The output from gam.check() is a heuristic; and just indicates there is some problem with the model. Ostensibly you are running this to check the basis dimension but when the EDF of terms is low compared the maximum (k') but the p value remains low, the can indicate other issues with the model. It's possible you are missing covariates (those things you excluded on the basis of the significance could well be controlling for variation in the response that is now showing up as unmodelled dependence).

    To check for this, model the residuals with all your covariates (either singly or in combination) and plot the residuals against covariates. You may also be missing some other covariate you didn't think of or measure. There you have less option as available as there's not much you can do about that unless you can introduce a surrogate term for this unmodelled "effect".

    Also, consider if there is unmodelled spatial or temporal structure, or clustering in the data (repeated observations from the same locations). All of these can cause the heuristic test in gam.check() to fail.