I am trying to run GAMs using binomial data (link=logit) on r with the mgcv package. This is to attempt to describe habitat use of bottlenose dolphins using presence (1) and absence (0) data as the response and a suite of environmental variables as the predictor.
The code I am using appears to be working fine however, when I plot residuals I am left with two distinct lines. My understanding is that when plotting residuals there should be an even scatter around the line - however this is not the case - any guidance on what I should be looking for would be greatly appreciated
Here is the output using an example of 2 variables:
m1<-gam(Presence~s(Dist_Ent_k,k=8)+s(Dist_wall_m,k=5), data=mydata,
family = binomial(link = "logit"), weights=resp.weight)
summary(m1)
Family: binomial
Link function: logit
Formula:
Presence ~ s(Dist_Ent_k, k = 8) + s(Dist_wall_m, k = 5)
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.30155 0.09839 -3.065 0.00218 **
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(Dist_Ent_k) 2.658 3.333 16.411 0.0015 **
s(Dist_wall_m) 1.389 1.680 0.273 0.7434
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.0359 Deviance explained = 3.42%
UBRE = -0.76828 Scale est. = 1 n = 2696
plot(m1,shade=T,scale = 0,residuals = TRUE)]
Thank you in advance!
What you are plotting are partial residuals and that you see two distinct bands is simply the result of your data being binary or Bernoulli observations.
You'll see this too if you plot the deviance residuals vs the linear predictor, just more extreme; try
layout(matrix(1:4, ncol = 2, byrow = TRUE))
gam.check(m1)
layout(1)
Model diagnostics for Bernoulli models (binomial with a single trial) are difficult because of the extreme nature of the data — the response is just a 0 or a 1. You can do diagnostics more easily for example if you aggregate the data in some way such that you no longer have m=1
trials but m=M
; say if your data are spatially arranged you could create a larger grid over the region and aggregate the 0s and 1s for the points in each grid, retaining information on how many points were in each grid (to give the M
for each aggregated binomial count).
Otherwise I don;t think there is much to be gained from plotting partial or deviance residuals for such models. The QQ-plot in the set from gam.check()
, especially if you add rep = 100
(or some such number) is more useful for checking distributional assumptions as it allows a reference band to be created which has good properties for models like this; see ?qq.gam
for the function/info needed to create only the QQ plot.