I am working with count data (available here) that are zero-inflated and overdispersed and has random effects. The package best suited to work with this sort of data is the glmmTMB
(details here and troubleshooting here).
Before working with the data, I inspected it for normality (it is zero-inflated), homogeneity of variance, correlations, and outliers. The data had two outliers, which I removed from the dataset linekd above. There are 351 observations from 18 locations (prop_id
).
The data looks like this:
euc0 ea_grass ep_grass np_grass np_other_grass month year precip season prop_id quad
3 5.7 0.0 16.7 4.0 7 2006 526 Winter Barlow 1
0 6.7 0.0 28.3 0.0 7 2006 525 Winter Barlow 2
0 2.3 0.0 3.3 0.0 7 2006 524 Winter Barlow 3
0 1.7 0.0 13.3 0.0 7 2006 845 Winter Blaber 4
0 5.7 0.0 45.0 0.0 7 2006 817 Winter Blaber 5
0 11.7 1.7 46.7 0.0 7 2006 607 Winter DClark 3
The response variable is euc0
and the random effects are prop_id
and quad_id
. The rest of the variables are fixed effects (all representing the percent cover of different plant species).
The model I want to run:
library(glmmTMB)
seed0<-glmmTMB(euc0 ~ ea_grass + ep_grass + np_grass + np_other_grass + month + year*precip + season*precip + (1|prop_id) + (1|quad), data = euc, family=poisson(link=identity))
fit_zinbinom <- update(seed0,family=nbinom2) #allow variance increases quadratically
The error I get after running the seed0
code is:
Error in optimHess(par.fixed, obj$fn, obj$gr) : gradient in optim evaluated to length 1 not 15 In addition: There were 50 or more warnings (use warnings() to see the first 50)
warnings()
gives:
1. In (function (start, objective, gradient = NULL, hessian = NULL, ... :
NA/NaN function evaluation
I also normally mean center and standardize my numerical variables, but this only removes the first error and keeps the NA/NaN
error. I tried adding a glmmTMBControl
statement like this OP, but it just opened a whole new world of errors.
How can I fix this? What am I doing wrong?
A detailed explanation would be appreciated so that I can learn how to troubleshoot this better myself in the future. Alternatively, I am open to a MCMCglmm
solution as that function can also deal with this sort of data (despite taking longer to run).
An incomplete answer ...
That said, let me run through some of the things I tried and where I ended up.
GGally::ggpairs(euc, columns=2:10)
doesn't detect anything obviously terrible about the data (I did throw out the data point with euc0==78
)In order to try to make the identity-link model work I added some code in glmmTMB. You should be able to install via remotes::install_github("glmmTMB/glmmTMB/glmmTMB@clamp")
(note you will need compilers etc. installed to install this). This version takes negative predicted values and forces them to be non-negative, while adding a corresponding penalty to the negative log-likelihood.
Using the new version of glmmTMB I don't get an error, but I do get these warnings:
Warning messages: 1: In fitTMB(TMBStruc) : Model convergence problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')
2: In fitTMB(TMBStruc) : Model convergence problem; false convergence (8). See vignette('troubleshooting')
The Hessian (second-derivative) matrix being non-positive-definite means there are some (still hard-to-troubleshoot) problems. heatmap(vcov(f2)$cond,Rowv=NA,Colv=NA)
lets me look at the covariance matrix. (I also like corrplot::corrplot.mixed(cov2cor(vcov(f2)$cond),"ellipse","number")
, but that doesn't work when vcov(.)$cond
is non-positive definite. In a pinch you can use sfsmisc::posdefify()
to force it to be positive definite ...)
Tried scaling:
eucsc <- dplyr::mutate_at(euc1,dplyr::vars(c(ea_grass:precip)), ~c(scale(.)))
This will help some - right now we're still doing a few silly things like treating year as a numeric variable without centering it (so the 'intercept' of the model is at year 0 of the Gregorian calendar ...)
But that still doesn't fix the problem.
Looking more closely at the ggpairs
plot, it looks like season
and year
are confounded: with(eucsc,table(season,year))
shows that observations occur in Spring and Winter in one year and Autumn in the other year. season
and month
are also confounded: if we know the month, then we automatically know the season.
At this point I decided to give up on the identity link and see what happened. update(<previous_model>, family=poisson)
(i.e. using a Poisson with a standard log link) worked! So did using family=nbinom2
, which was much better.
I looked at the results and discovered that the CIs for the precip X season coefficients were crazy, so dropped the interaction term (update(f2S_noyr_logNB, . ~ . - precip:season)
) at which point the results look sensible.
A few final notes:
family=nbinom2
) are probably sufficient.library(DHARMa); plot(simulateResiduals(f2S_noyr_logNB2))
). I would spend some time plotting residuals and predicted values against various combinations of predictors to see if you can localize the problem.PS A quicker way to see that there's something wrong with the fixed effects (multicollinearity):
X <- model.matrix(~ ea_grass + ep_grass +
np_grass + np_other_grass + month +
year*precip + season*precip,
data=euc)
ncol(X) ## 13
Matrix::rankMatrix(X) ## 11
lme4
has tests like this, and machinery for automatically dropping aliased columns, but they aren't implemented in glmmTMB
at present.