Search code examples
rcross-validationglmnetparallel-foreachdoparallel

cv.glmnet parallel and memory issue


It is the first time I am using parallel processing in general. The question is mainly about my poor syntax.

I would like some help in capturing the output for a large number of cv.glmnet iterations, as I believe I have built cv_loop_run to be badly inefficient. This, along with the number of lambdas being 10k leads to a massive matrix which takes all of my memory and causes a crash. In essence what I need is the minimum and the 1se lambda by each run (1000 of them, not all 10,000). So instead of having a 1kx10k list captured for cv_loop_run I would get a 1k long list.

  registerDoParallel(cl=8,cores=4)  
  cv_loop_run<- rbind( foreach(r = 1:1000,
                              .packages="glmnet",
                              .combine=rbind,
                              .inorder =F) %dopar% {

                        cv_run <-cv.glmnet(X_predictors,Y_dependent,nfolds=fld,
                                           nlambda = 10000,
                                           alpha = 1, #FOR LASSO
                                           grouped = FALSE,
                                           parallel= TRUE
                                          )

                                                   }
                    )
  l_min<- as.matrix(unlist(as.matrix(cv_loop_run[,9 ,drop=FALSE] ))) # matrix  #9  is lamda.min

  l_1se<- as.matrix(unlist(as.matrix(cv_loop_run[,10 ,drop=FALSE] ))) # matrix  #10  is lamda.1se

Solution

  • Ok, so I found it myself. All I have to do is restrict the output of each cv.glmnet run. That way only the minimum and the 1se lambdas are getting picked up from each run. This means that this:

    cv_run <-cv.glmnet(X_predictors,Y_dependent,nfolds=fld,
                                           nlambda = 10000,
                                           alpha = 1, #FOR LASSO
                                           grouped = FALSE,
                                           parallel= TRUE
                                          )
    

    becomes this:

    cv_run <-cv.glmnet(X_predictors,Y_dependent,nfolds=fld,
                                           nlambda = 10000,
                                           alpha = 1, #FOR LASSO
                                           grouped = FALSE,
                                           parallel= TRUE
                                          )[9:10]