Search code examples
rconvergencestandardizedglmmtmb

zero-inflated overdispersed count data glmmTMB error in R


I am working with count data (available here) that are zero-inflated and overdispersed and has random effects. The package best suited to work with this sort of data is the glmmTMB (details here and troubleshooting here).

Before working with the data, I inspected it for normality (it is zero-inflated), homogeneity of variance, correlations, and outliers. The data had two outliers, which I removed from the dataset linekd above. There are 351 observations from 18 locations (prop_id).

The data looks like this:

euc0 ea_grass ep_grass np_grass np_other_grass month year precip season   prop_id quad
 3      5.7      0.0     16.7            4.0     7 2006    526 Winter    Barlow    1
 0      6.7      0.0     28.3            0.0     7 2006    525 Winter    Barlow    2
 0      2.3      0.0      3.3            0.0     7 2006    524 Winter    Barlow    3
 0      1.7      0.0     13.3            0.0     7 2006    845 Winter    Blaber    4
 0      5.7      0.0     45.0            0.0     7 2006    817 Winter    Blaber    5
 0     11.7      1.7     46.7            0.0     7 2006    607 Winter    DClark    3

The response variable is euc0 and the random effects are prop_id and quad_id. The rest of the variables are fixed effects (all representing the percent cover of different plant species).

The model I want to run:

library(glmmTMB)
seed0<-glmmTMB(euc0 ~ ea_grass + ep_grass + np_grass + np_other_grass + month + year*precip + season*precip + (1|prop_id)  + (1|quad), data = euc, family=poisson(link=identity))

fit_zinbinom <- update(seed0,family=nbinom2) #allow variance increases quadratically

The error I get after running the seed0 code is:

Error in optimHess(par.fixed, obj$fn, obj$gr) : gradient in optim evaluated to length 1 not 15 In addition: There were 50 or more warnings (use warnings() to see the first 50)

warnings() gives:

1. In (function (start, objective, gradient = NULL, hessian = NULL,  ... :
  NA/NaN function evaluation

I also normally mean center and standardize my numerical variables, but this only removes the first error and keeps the NA/NaN error. I tried adding a glmmTMBControl statement like this OP, but it just opened a whole new world of errors.

How can I fix this? What am I doing wrong?

A detailed explanation would be appreciated so that I can learn how to troubleshoot this better myself in the future. Alternatively, I am open to a MCMCglmm solution as that function can also deal with this sort of data (despite taking longer to run).


Solution

  • An incomplete answer ...

    • identity-link models for limited-domain response distributions (e.g. Gamma or Poisson, where negative values are impossible) are computationally problematic; in my opinion they're often conceptually problematic as well, although there are some reasonable arguments in their favor. Do you have a good reason to do this?
    • This is a pretty small data set for the model you're trying to fit: 13 fixed-effect predictors and 2 random-effect predictors. The rule of thumb would be that you want about 10-20 times that many observations: that seems to fit in OK with your 345 or so observations, but ... only 40 of your observations are non-zero! That means your 'effective' number of observations/amount of information will be much smaller (see Frank Harrell's Regression Modeling Strategies for more discussion of this point).

    That said, let me run through some of the things I tried and where I ended up.

    • GGally::ggpairs(euc, columns=2:10) doesn't detect anything obviously terrible about the data (I did throw out the data point with euc0==78)

    In order to try to make the identity-link model work I added some code in glmmTMB. You should be able to install via remotes::install_github("glmmTMB/glmmTMB/glmmTMB@clamp") (note you will need compilers etc. installed to install this). This version takes negative predicted values and forces them to be non-negative, while adding a corresponding penalty to the negative log-likelihood.

    Using the new version of glmmTMB I don't get an error, but I do get these warnings:

    Warning messages: 1: In fitTMB(TMBStruc) : Model convergence problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')
    2: In fitTMB(TMBStruc) : Model convergence problem; false convergence (8). See vignette('troubleshooting')

    The Hessian (second-derivative) matrix being non-positive-definite means there are some (still hard-to-troubleshoot) problems. heatmap(vcov(f2)$cond,Rowv=NA,Colv=NA) lets me look at the covariance matrix. (I also like corrplot::corrplot.mixed(cov2cor(vcov(f2)$cond),"ellipse","number"), but that doesn't work when vcov(.)$cond is non-positive definite. In a pinch you can use sfsmisc::posdefify() to force it to be positive definite ...)

    Tried scaling:

    eucsc <- dplyr::mutate_at(euc1,dplyr::vars(c(ea_grass:precip)), ~c(scale(.)))
    

    This will help some - right now we're still doing a few silly things like treating year as a numeric variable without centering it (so the 'intercept' of the model is at year 0 of the Gregorian calendar ...)

    But that still doesn't fix the problem.

    Looking more closely at the ggpairs plot, it looks like season and year are confounded: with(eucsc,table(season,year)) shows that observations occur in Spring and Winter in one year and Autumn in the other year. season and month are also confounded: if we know the month, then we automatically know the season.

    At this point I decided to give up on the identity link and see what happened. update(<previous_model>, family=poisson) (i.e. using a Poisson with a standard log link) worked! So did using family=nbinom2, which was much better.

    I looked at the results and discovered that the CIs for the precip X season coefficients were crazy, so dropped the interaction term (update(f2S_noyr_logNB, . ~ . - precip:season)) at which point the results look sensible.

    A few final notes:

    • the variance associated with quadrat is effectively zero
    • I don't think you necessarily need zero-inflation; low means and overdispersion (i.e. family=nbinom2) are probably sufficient.
    • the distribution of the residuals looks OK, but there still seems to be some model mis-fit (library(DHARMa); plot(simulateResiduals(f2S_noyr_logNB2))). I would spend some time plotting residuals and predicted values against various combinations of predictors to see if you can localize the problem.

    PS A quicker way to see that there's something wrong with the fixed effects (multicollinearity):

    X <- model.matrix(~ ea_grass + ep_grass +
                       np_grass + np_other_grass + month +
                       year*precip + season*precip,
                      data=euc)
    ncol(X)  ## 13
    Matrix::rankMatrix(X) ## 11
    

    lme4 has tests like this, and machinery for automatically dropping aliased columns, but they aren't implemented in glmmTMB at present.