Search code examples
rparallel-processingglmnet

Parallelize r script


To start, I have a rudimentary familiarity with the doparallel and parallel packages in R, so please refrain suggesting these packages without example code.

I am currently working with LASSO regression models generated using the glmnet package. The am relying on the cv.glmnet function in this packages to tell me what the ideal lamda is... all of this junk is irreverent to my actual question, but I hope the back story helps. The cv.glmnet function does what I want, but takes too long. I want to parallelize it.

My issue is that the parallel r packages are designed to take a list and then apply an operation to that list, so when I try to pass a polished function like cv.glmnet (even though it is iterative), I get a single core processing the single dataset I want cv.glmnet to process, rather than this process being distributed across all the cores on my server.

Is it possible to distribute a single computation across multiple CPUs/cores in r (which packages, example code, etc)? Or, is it possible to make parallelizing packages, like parallel and doparallel, recognize the iterative structure of the cv.glmnet function and then distribute it for me? I'm fishing for recommendations, any help or insight would be greatly appreciated.

Unfortunately,I do not have permission to share the data I'm working with. For a reproducible example, please see this post, the code from the answer is copy/paste quality to generate data, lasso regressions and gives an example use of the cv.glmnet function: https://stats.stackexchange.com/questions/72251/an-example-lasso-regression-using-glmnet-for-binary-outcome


Solution

  • cv.glmnet is easily parallelizable set the parallel parameter = TRUE

    An example on how to do this can be found on the documentation

    https://www.rdocumentation.org/packages/glmnet/versions/2.0-16/topics/cv.glmnet

    This example uses doMC but you should be able to easily change it to use parallel package

    require(doMC)
    registerDoMC(cores=4)
    x = matrix(rnorm(1e5 * 100), 1e5, 100)
    y = rnorm(1e5)
    system.time(cv.glmnet(x,y))                # not parallel
    system.time(cv.glmnet(x,y,parallel=TRUE))  # this is parallel
    

    the parallel version would look like :

    library(doParallel)
    library(glmnet)
    no_cores <- detectCores() - 1
    print(no_cores)
    # Initiate cluster
    cl <- makeCluster(no_cores)
    registerDoParallel(cl)
    
    x = matrix(rnorm(1e5 * 100), 1e5, 100)
    y = rnorm(1e5)
    system.time(cv.glmnet(x,y))                # not parallel
    system.time(cv.glmnet(x,y,parallel=TRUE))  # this is parallel
    stopCluster(cl)
    

    To add to your question there is a category of problems called "Embarrassingly Parallel" that can be trivially parallelized, these packages most use the foreach loop so that code inside those loops can be parallelized. So all that is needed for this case is to enable parallelization (register a parallel backend) and the foreach loop will execute in parallel.