r parallel-processing mixed-models glmmtmb glmm

R package glmmTMB - model family nbinom2 - Error in MakeADFunObject

I could use some quick assistance with fitting a negative binomial model in R using the "glmmTMB" package with the family set to nbinom2. I opted for glmmTMB because it allows specifying fixed and random effects in both the main and zero-inflated parts of the formula, and it also supports parallel computing.

nt <- parallel::detectCores()-1
neg_bin <- glmmTMB(eq_main, # eq_main has both fixed and random effects, and an offset term in fixed effects
                    data = x,
                    ziformula = eq_zeros, # eq_zeros has only fixed effects; no random effects
                    family = nbinom2,
                    REML = TRUE, control = glmmTMBControl(parallel = nt))

However, I've hit a roadblock with the following error:

"Error in MakeADFunObject(data, parameters, reportenv, ADreport = ADreport, : Caught exception 'std::bad_alloc' in function 'MakeADFunObject'"

Can someone shed light on what this error means and suggest steps I could take to resolve it? I would like to think that it is not a memory issue because I am utilizing the most state-of-the-art machine I was able to get my hands on (A.K.A. a supercomputer). Thanks in advance for your help!

NOTE: I recognize the importance of sharing a reproducible example, but due to the extensive size of the dataset (comprising several hundred variables), I'm currently refraining from providing one. If the community deems it necessary, I am more than willing to share a reproducible example upon request.

Solution

This is a "running-out-memory" error.

it would help a lot if you told us more about the dimensions of your problem: how many observations total? How many variables, and in particular how many factor variables with how many levels? In particular, what is

dim(model.matrix(lme4::nobars(eq_main), data = x))

(and the equivalent for your zero-inflation model formula)?

"most state-of-the-art machine I was able to get my hands on" is actually not very descriptive; how much RAM is available? (If you are running this in a high-performance-computing (HPC) facility, how much memory have you requested for the job?)
can you try running some examples with small subsets of your data (subsetting predictor variables, observations, or both), and see (1) how big a subset you can successfully run and (2) how the memory requirements and computing time scale with problem size? (The peakRAM package is useful for this — it will report elapsed time, memory usage, and peak memory usage.)

I would be surprised if parallelizing is affecting your memory usage (glmmTMB uses OpenMP, which is a shared-memory approach), but it couldn't hurt to try without parallelizing to see if it makes a difference.

The only suggestion I can make off the top of my head/without further information that might help is to try sparseX = TRUE in your glmmTMB() call: if you have a lot of factor variables with many levels, which will get expanded into many columns containing mostly zeros, this could reduce the memory footprint of your problem.