Search code examples
rgammgcv

GAM residuals in two distinct lines - R "mgcv"


I am trying to run GAMs using binomial data (link=logit) on r with the mgcv package. This is to attempt to describe habitat use of bottlenose dolphins using presence (1) and absence (0) data as the response and a suite of environmental variables as the predictor.

The code I am using appears to be working fine however, when I plot residuals I am left with two distinct lines. My understanding is that when plotting residuals there should be an even scatter around the line - however this is not the case - any guidance on what I should be looking for would be greatly appreciated

Here is the output using an example of 2 variables:

m1<-gam(Presence~s(Dist_Ent_k,k=8)+s(Dist_wall_m,k=5), data=mydata, 
        family = binomial(link = "logit"), weights=resp.weight)

summary(m1)

Family: binomial 
Link function: logit 

Formula:
Presence ~ s(Dist_Ent_k, k = 8) + s(Dist_wall_m, k = 5)

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept) -0.30155    0.09839  -3.065  0.00218 **

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                 edf Ref.df Chi.sq p-value   
s(Dist_Ent_k)  2.658  3.333 16.411  0.0015 **
s(Dist_wall_m) 1.389  1.680  0.273  0.7434   

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.0359   Deviance explained = 3.42%
UBRE = -0.76828  Scale est. = 1         n = 2696

plot(m1,shade=T,scale = 0,residuals = TRUE)]

Thank you in advance!


Solution

  • What you are plotting are partial residuals and that you see two distinct bands is simply the result of your data being binary or Bernoulli observations.

    You'll see this too if you plot the deviance residuals vs the linear predictor, just more extreme; try

    layout(matrix(1:4, ncol = 2, byrow = TRUE))
    gam.check(m1)
    layout(1)
    

    Model diagnostics for Bernoulli models (binomial with a single trial) are difficult because of the extreme nature of the data — the response is just a 0 or a 1. You can do diagnostics more easily for example if you aggregate the data in some way such that you no longer have m=1 trials but m=M; say if your data are spatially arranged you could create a larger grid over the region and aggregate the 0s and 1s for the points in each grid, retaining information on how many points were in each grid (to give the M for each aggregated binomial count).

    Otherwise I don;t think there is much to be gained from plotting partial or deviance residuals for such models. The QQ-plot in the set from gam.check(), especially if you add rep = 100 (or some such number) is more useful for checking distributional assumptions as it allows a reference band to be created which has good properties for models like this; see ?qq.gam for the function/info needed to create only the QQ plot.