I am trying to save multiple GLM objects in a list. One GLM object is trained on a large dataset, but the size of the object is reduces by setting NULL all the unnecessary data in the GLM object. The problem is that I get RAM issues because R reserves much more RAM than the size of the GLM object. Does someone know why this problem occur and how I can solve this? Behind this saving the object results in a larger file than the object size.
Example:
> glm_full <- glm(formula = formule , data = dataset, family = binomial(), model = F, y = F)
> glm_full$data <- glm_full$model <- glm_full$residuals <- glm_full$fitted.values <- glm_full$effects <- glm_full$qr$qr <- glm_full$linear.predictors <- glm_full$weights <- glm_full$prior.weights <- glm_full$y <- NULL
> rm(list= ls()[!(ls() %in% c('glm_full'))])
> object.size(glm_full)
172040 bytes
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 944802 50.5 3677981 196.5 3862545 206.3
Vcells 83600126 637.9 503881514 3844.4 629722059 4804.4
> rm(glm_full)
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 944208 50.5 2942384 157.2 3862545 206.3
Vcells 4474439 34.2 403105211 3075.5 629722059 4804.4
Here you can see that R reserves RAM for the GLM object, saving multiple GLM objects in the environment results in out of RAM problems.
A rough explanation for this is that glm
hides pointers to the environment and things from the environment deep down inside of the glm
object (and in numerous places).
What do you need to be able to do with your glm
? Even though you've nulled out a lot of the "fat" of the model, your object size will still grow linearly with your data size, and when you compound that by storing multiple glm
objects, bumping up against RAM limitations is an obvious concern.
Here is a function that will allow you to slice away pretty much everything that is non-essential, and the best part is that the glm
object size will remain constant regardless of how large your data gets.
stripGlmLR = function(cm) {
cm$y = c()
cm$model = c()
cm$residuals = c()
cm$fitted.values = c()
cm$effects = c()
cm$qr$qr = c()
cm$linear.predictors = c()
cm$weights = c()
cm$prior.weights = c()
cm$data = c()
cm$family$variance = c()
cm$family$dev.resids = c()
cm$family$aic = c()
cm$family$validmu = c()
cm$family$simulate = c()
attr(cm$terms,".Environment") = c()
attr(cm$formula,".Environment") = c()
cm
}
Some notes:
You can null out model$family
entirely and the predict
function will still return its default value (so, predict(model, newdata = data))
will work). However, predict(model, newdata=data, type = 'response')
will fail. You can recover the response
by passing the link value through the inverse link function: in the case of logistic regression, this is the sigmoid function, sigmoid(x) = 1/(1 + exp(-x))
. (not sure about type = 'terms'
)
Most importantly, any of the other things besides predict
that you might like to do with a glm
model will fail on the stripped-down version (so summary()
, anova()
, and step()
are all a no-go). Thus, you'd be wise to extract all of this info from your glm
object and then running the stripGlmLR
function.
CREDIT: Nina Zumel for an awesome analysis on glm
object memory allocation