Search code examples
rgam

mgcv bam() error: cannot allocate vector of size 99.6 Gb


I am trying to fit an additive mixed model using bam (mgcv library). My dataset has 10^6 observations from a longitudinal study on growth in 2.10^5 children nested in 300 health centers. I am looking for the slope for each center. The model is

bam(haz ~ s(month, bs = "cc", k = 12)+ sex+ s(age)+ center+ year+ year*center+s(child, bs="re"), data)

Whenever, when I try to fit the model the following error message appears:

Error: cannot allocate vector of size 99.6 Gb
In addition: Warning message:
In matrix(by, n, q) : data length exceeds size of matrix

I am working on a cluster with 500 Gb de RAM.

Thank you for any help


Solution

  • To diagnose more precisely where the problem is, try fitting your model with various terms left out. There are several terms in the model that could blow up on you:

    • the fixed effects involving center will blow up to 300 columns * 10^6 rows; depending on whether year is numeric or a factor, the year*center term could blow up to 600 columns or (nyears*300) columns
    • it's not clear to me whether bam uses sparse matrices for s(.,bs="re") terms; if not, you'll be in big trouble (2*10^5 columns * 10^6 rows)

    Order of magnitude, a vector of 10^6 numeric values (one column of your model matrix) takes 7.6 Mb, so 500 GB / 7.6 MB would be approximately 65,000 columns ...

    Just taking a guess here, but I would try out the gamm4 package. It's not specifically geared for low-memory use, but:

    ‘gamm4’ is most useful when the random effects are not i.i.d., or when there are large numbers of random coeffecients [sic] (more than several hundred), each applying to only a small proportion of the response data.

    I would also make most of the terms into random effects:

    gamm4::gamm4(haz ~ s(month, bs = "cc", k = 12)+ sex+ s(age)+ 
     (1|center)+ (1|year)+ (1|year:center)+(1|child), data)
    

    or, if there are not very many years in the data set, treat year as a fixed effect:

    gamm4::gamm4(haz ~ s(month, bs = "cc", k = 12)+ sex+ s(age)+ 
     year + (1|center)+ (1|year:center)+(1|child), data)
    

    If there are a small number of years then (year|center) might make sense, to assess among-center variation and covariation among years ... if there are many years, consider making it a smooth term instead ...