I have a small (n=28) dataset of three seabird occurrence count data and have run a hurdel GAM model (using mgcv::gam()) with first using a binary model with presence/absecene, and then a negative binomial model with just the presence. The presence model brings the sample size to 12,12, and 22 for each seabird. the seabird data are also overdispered with high zeros and often low (< 10) occurrence at presence points. this is the models for each seabird: #three seabirds; prion, storm petrel, sooty shearwater
prion_binary <- mgcv::gam(prion_binary ~ s(avg_SST) +
s(avg_SSS)+
s(delta_SST)+
s(delta_SSS)+
s(distance, k=8)+ # 9 different distances
s(total_zp)+ #total zooplankton
s(trip_factor,bs = "re"),
method = "ML",
family = binomial(link = "logit"),
data = seabird)
prion_count <- mgcv::gam(prion ~ s(avg_SST) +
s(avg_SSS)+
s(delta_SST)+
s(delta_SSS)+
s(distance, k=5)+ # 6 different distances
s(total_zp)+ #total zooplankton
s(trip_factor,bs = "re"),
method = "ML",
family = "ziP",
data = seabird[seabird$prion >0,])
My issue is the output for the models shows very high devience explained and relativly no significant predictors. In one case the r2 was also negative. I think I likely have too many predictors but when I run univariate models all are coming out with deviance explained and p<0.05 so not sure which to remove. The residual plots also don't aline with such high dev explined.
not sure where to go next so any help would be appreciated.
this is the output from the three seabird models:
prion_binary
Family: binomial
Link function: logit
Formula:
prion_binary ~ s(avg_SST) + s(avg_SSS) + s(delta_SST) + s(delta_SSS) +
s(distance, k = 8) + s(total_zp) + s(trip_factor, bs = "re")
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.557 4.021 -0.636 0.525
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(avg_SST) 1.000 1.000 0.687 0.407
s(avg_SSS) 1.000 1.000 0.324 0.569
s(delta_SST) 1.000 1.000 0.282 0.596
s(delta_SSS) 1.000 1.000 0.440 0.507
s(distance) 1.000 1.000 0.963 0.326
s(total_zp) 1.742 2.051 0.782 0.736
s(trip_factor) 1.349 3.000 3.615 0.120
R-sq.(adj) = 0.995 Deviance explained = 97.5%
-ML = 6.8535 Scale est. = 1 n = 28
prion count:
Family: Negative Binomial(2277108.965)
Link function: log
Formula:
prion ~ s(avg_SST) + s(avg_SSS) + s(delta_SST) + s(delta_SSS) +
s(distance, k = 5) + s(total_zp) + s(trip_factor, bs = "re")
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.8218 0.1979 4.153 3.29e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(avg_SST) 1.000e+00 1 0.017 0.896
s(avg_SSS) 1.000e+00 1 0.675 0.411
s(delta_SST) 1.000e+00 1 0.085 0.771
s(delta_SSS) 1.000e+00 1 0.148 0.700
s(distance) 1.000e+00 1 0.727 0.394
s(total_zp) 1.000e+00 1 0.059 0.809
s(trip_factor) 1.016e-07 2 0.000 0.508
R-sq.(adj) = -0.177 Deviance explained = 55.1%
-ML = 17.514 Scale est. = 1 n = 12
sooty shearwater binary
Family: binomial
Link function: logit
Formula:
shearwater_binary ~ s(avg_SST) + s(avg_SSS) + s(delta_SST) +
s(delta_SSS, k = 15) + s(distance, k = 8) + s(total_zp) +
s(trip_factor, bs = "re")
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.611 2.656 1.36 0.174
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(avg_SST) 1.000 1 0.917 0.33829
s(avg_SSS) 1.000 1 0.914 0.33915
s(delta_SST) 1.000 1 0.000 0.99210
s(delta_SSS) 1.000 1 0.017 0.89504
s(distance) 1.000 1 0.004 0.94848
s(total_zp) 1.000 1 0.113 0.73652
s(trip_factor) 1.141 3 11.683 0.00514 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.472 Deviance explained = 59.4%
-ML = 9.4597 Scale est. = 1 n = 28
shearwater count
Family: Negative Binomial(6642419.022)
Link function: log
Formula:
shearwater ~ s(avg_SST) + s(avg_SSS) + s(delta_SST) + s(delta_SSS) +
s(distance, k = 8) + s(total_zp) + s(trip_factor, bs = "re")
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.9002 0.1211 15.7 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(avg_SST) 1.000e+00 1.000 1.841 0.1748
s(avg_SSS) 3.606e+00 4.272 125.264 < 2e-16 ***
s(delta_SST) 1.000e+00 1.000 5.619 0.0178 *
s(delta_SSS) 1.000e+00 1.000 18.094 2.11e-05 ***
s(distance) 4.657e+00 5.328 277.393 < 2e-16 ***
s(total_zp) 1.000e+00 1.000 10.490 0.0012 **
s(trip_factor) 9.002e-07 3.000 0.000 0.4375
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.999 Deviance explained = 99.5%
-ML = 67.138 Scale est. = 1 n = 22
storm petrel binary
Family: binomial
Link function: logit
Formula:
storm_petrel_binary ~ s(avg_SST) + s(avg_SSS) + s(delta_SST) +
s(delta_SSS) + s(distance, k = 8) + s(total_zp) + s(trip_factor,
bs = "re")
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.7884 0.9928 -0.794 0.427
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(avg_SST) 1.000e+00 1.00 0.174 0.676
s(avg_SSS) 1.000e+00 1.00 0.190 0.663
s(delta_SST) 1.000e+00 1.00 1.038 0.308
s(delta_SSS) 1.000e+00 1.00 0.000 0.996
s(distance) 3.003e+00 3.69 5.302 0.213
s(total_zp) 1.000e+00 1.00 0.039 0.844
s(trip_factor) 5.115e-07 3.00 0.000 0.369
R-sq.(adj) = 0.595 Deviance explained = 64.9%
-ML = 12.629 Scale est. = 1 n = 28
storm peterl count
Family: Negative Binomial(1572380.699)
Link function: log
Formula:
storm_petrel ~ s(avg_SST) + s(avg_SSS) + s(delta_SST) + s(delta_SSS) +
s(distance, k = 5) + s(total_zp) + s(trip_factor, bs = "re")
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.8340 0.2065 4.039 5.36e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(avg_SST) 1.000e+00 1 2.654 0.1033
s(avg_SSS) 1.000e+00 1 0.861 0.3535
s(delta_SST) 1.000e+00 1 1.389 0.2386
s(delta_SSS) 1.000e+00 1 4.626 0.0315 *
s(distance) 1.000e+00 1 0.562 0.4534
s(total_zp) 1.000e+00 1 0.580 0.4463
s(trip_factor) 1.018e-07 2 0.000 0.2196
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.647 Deviance explained = 73.8%
-ML = 19.901 Scale est. = 1 n = 12
What I'm guessing is going on (but can't know without seeing the data) is that your explanatory variables are highly correlated with each other. The significance of each variable is calculated based on how much additional variance is explained when you add that variable to a reduced model with all the variables except that one. So if your explanatory variables are collinear, adding another one isn't going to explain much variance that the others haven't.
Also, definitely too many predictors for the data you have. That could, quite possibly, be the sole reason your explained deviance is so high. For only 12 data, you probably don't want more than one or two predictors (though read elsewhere for other opinions).
One possible way forward would be to do a principal component analysis of your explanatory variables, or of a subset of your explanatory variables that would naturally group together. If one or two principal components explain a large proportion of the variance in your explanatory variables, then use those principal components as your predictors instead.
Another possibility would be to jettison any predictors that seem less important a priori (emphasis on the a priori part).
Also, you will probably get better answers than this on Stats.SE.