I want to detect interaction effects in a dataset and I have written code that creates all the possible combinations of predictors, fits gml models with each pair separately and stores the vital statistics of the model and the model itself.
My dataset is comprised of 100,000 + observations and I want to examine 200,000 + possible pair combinations.
The code runs without errors but the problem is that after the 2,000 th iteration the 60 GB RAM of my PC has been filled (when I begin running the code there are 58 GB of free RAM )
For a reproducible example I will use the mtcars dataset:
data(mtcars)
setDT(mtcars)
predictor_names <- setdiff(names(mtcars) , "am")
combinations <- combn(length(predictor_names) , 2)
combinations <- t(combinations)
combinations <- as.data.frame(combinations)
models_glm <- list()
Coefficients_dt <- data.table(Predictor_1 = character() , Predictor_2 = character(), dev_ratio = numeric() ,
Estimate = numeric(), p.value = numeric())
system.time(
for (i in (1 : (nrow(combinations) - 1 ))) {
# Extracts the index positions of the names of the pairs
#----------------------------------------------------------
m <- combinations[i, 1]
k <- combinations[i, 2]
# Extracts the names of the predictors from a vector that holds them
#------------------------------------------------------------------------
m_name <- predictor_names[m]
k_name <- predictor_names[k]
# Uses the names of the predictors to construct a formula
#------------------------------------------------------------------
formula_glm <- paste0( "am ~ " , m_name, " * " , k_name)
formula_glm <- as.formula(formula_glm )
# Passes the formula to a glm model
#-------------------------------------------------------------------
model <- glm(formula_glm , mtcars, family = "binomial")
# Stores the model to a list
#-------------------------------------------------------------------
models_glm [[ paste0(m_name , "_*_" , k_name)]] <- model
# Calculates the dev.ratio
#---------------------------------------------------------------
residual.deviance <- model$deviance
null.deviance <- model$null.deviance
dev.ratio <- (null.deviance - residual.deviance) / null.deviance
# Extracts the Coefficient estimate and p-value from the model
#-------------------------------------------------------------------
Coefficients_df <- as.data.frame(summary(model)$coefficients)
names(Coefficients_df) <- c("Estimate" , "SE" , "Z", "p.value")
if(dim(Coefficients_df)[1] == 4){
Coefficients_dt <- rbind(Coefficients_dt , data.table(
Predictor_1 = m_name ,
Predictor_2 = k_name,
dev_ratio = dev.ratio,
Estimate = Coefficients_df$Estimate[4] ,
p.value = Coefficients_df$p.value[4]
))
}
}
)
What can I do to overcome this problem?
I.e. I would like to understand the root cause of the problem: What is taking space in RAM? The objects involved are not very large compared to the available RAM. Specifically the Coefficients_dt data.table at the most will become 200,000 row x 5 columns large.
So something else is going on and consumes more and more RAM as the iterations build up in the for-loop.
Next I would like to understand if there is some action I can take in the middle of the execution of the for-loop -- e.g. command nested in an if statement within the for loop-- that would free RAM space while possibly saving any objects that would be removed from the RAM and should be safeguarded.
Your advice will be appreciated.
Consider the following options:
Pre-allocate any needed object so instead of expanding it with values iteratively which requires the machine to reallocate space using memory, you assign values to existing elements:
models_glm <- vector(mode = "list", length = 45)
In fact, even consider naming the elements beforehand:
pnames <- sapply(1:nrow(combinations)-1, function(i){
paste0(predictor_names[combinations[i,1]], "_*_",
predictor_names[combinations[i,2]])
})
models_glm <- setNames(vector(mode="list", length=45), pnames)
Use data.table::rbindlist()
to row bind a list of data tables into one large dataframe in one call instead of expanding dataframe row by row inside a loop. Below uses lapply
returning an object equal to length of input. Also, notice empty datatable to avoid NULL
returns, left out of rbindlist
:
dTList <- lapply(seq(nrow(combinations)-1), function(i) {
#... same as above
# use <<- operator to update environment object outside function
models_glm[[paste0(m_name, "_*_", k_name)]] <<- model
#...
Coefficients_df <- setNames(as.data.frame(summary(model)$coefficients),
c("Estimate", "SE", "Z", "p.value"))
if(dim(Coefficients_df)[1] == 4){
data.table(
Predictor_1 = m_name ,
Predictor_2 = k_name,
dev_ratio = dev.ratio,
Estimate = Coefficients_df$Estimate[4],
p.value = Coefficients_df$p.value[4]
)
} else {
# RETURN EMPTY DT
data.table(
Predictor_1 = character(),
Predictor_2 = character(),
dev_ratio = numeric(),
Estimate = numeric(),
p.value = numeric()
)
}
})
coefficients <- data.table::rbindlist(dTlist)
rm(dTlist)
gc()
Finally, for large operations that do not need design/programming work, consider using the automated Rscript.exe over RStudio or Rgui as these later programs require additional resources. Below is a command line that can run from PowerShell, CMD prompt, or batch (.bat) file assuming Rscript is an environment PATH variable:
Rscript "C:\Path\To\ModelCoefficientDataBuild.R"
Specifically, RStudio's rsession.exe on Windows tends to not release memory back to OS once it obtains it until the session is over. See RStudio forum posts on subject. Of course be sure to save your needed objects to disk for use later:
saveRDS(coefficients, "coefficients_datatable.rds")