Search code examples
rforeachh2ogbmdoparallel

run h2o algorithms inside a foreach loop?


I naively thought it's straight forward to make multiple calls to h2o.gbm in parallel inside a foreach loop. But got a strange error.

Error in { : 
         task 3 failed - "java.lang.AssertionError: Can't unlock: Not locked!"

Codes below

library(foreach)
library(doParallel)
library(doSNOW)

Xtr.hf = as.h2o(Xtr)
Xval.hf = as.h2o(Xval)

cl = makeCluster(6, type="SOCK")
registerDoSNOW(cl)
junk <- foreach(i=1:6, 
            .packages=c("h2o"), 
            .errorhandling = "stop",
            .verbose=TRUE) %dopar% 
{
   h2o.init(ip="localhost", nthreads=2, max_mem_size = "5G") 
   for ( j in 1:3 ) { 
     bm2 <- h2o.gbm(
     training_frame = Xtr.hf,  
     validation_frame = Xval.hf, 
     x=2:ncol(Xtr.hf),
     y=1,          
     distribution="gaussian",
     ntrees = 100,
     max_depth = 3,
     learn_rate = 0.1,
     nfolds = 1)
  }
  h2o.shutdown(prompt=FALSE)    
  return(iname)
}
stopCluster(cl)

Solution

  • NOTE: This unlikely good use of R's parallel foreach, but I'll answer your question first, then explain why. (BTW when I use "cluster" in this answer I'm referring to an H2O cluster (even if is just on your local machine), and not an R "cluster".)

    I've re-written your code, assuming the intention was to have a single H2O cluster, where all the models are to be made:

    library(foreach)
    library(doParallel)
    library(doSNOW)
    library(h2o)
    
    h2o.init(ip="localhost", nthreads=-1, max_mem_size = "5G") 
    
    Xtr.hf = as.h2o(Xtr)
    Xval.hf = as.h2o(Xval)
    
    cl = makeCluster(6, type="SOCK")
    registerDoSNOW(cl)
    junk <- foreach(i=1:6, 
                .packages=c("h2o"), 
                .errorhandling = "stop",
                .verbose=TRUE) %dopar% 
    {
       for ( j in 1:3 ) { 
         bm2 <- h2o.gbm(
         training_frame = Xtr.hf,  
         validation_frame = Xval.hf, 
         x=2:ncol(Xtr.hf),
         y=1,          
         distribution="gaussian",
         ntrees = 100,
         max_depth = 3,
         learn_rate = 0.1,
         nfolds = 1)
    
       #TODO: do something with bm2 here?
    
      }
      return(iname)  #???
    }
    stopCluster(cl)
    

    I.e. in outline form:

    • Start H2O, and load Xtr and Xval into it
    • Start 6 threads in your R client
    • In each thread make 3 GBM models (one after each other)

    I dropped the h2o.shutdown() command, guessing that you didn't intend that (when you shutdown the H2O cluster the models you just made get deleted). And I've highlighted where you might want to be doing something with your model. And I've given H2O all the threads on your machine (that is the nthreads=-1 in h2o.init()), not just 2.

    You can make H2O models in parallel, but it is generally a bad idea, as they end up fighting for resources. Better to do them one at a time, and rely on H2O's own parallel code to spread the computation over the cluster. (When the cluster is a single machine this tends to be very efficient.)

    By the fact you've gone to the trouble of making a parallel loop in R, makes me think you've missed the way H2O works: it is a server written in Java, and R is just a light client that sends it API calls. The GBM calculations are not done in R; they are all done in Java code.

    The other way to interpret your code is to run multiple instances of H2O, i.e. multiple H2O clusters. This might be a good idea if you have a set of machines, and you know the H2O algorithm is not scaling very well across a multi-node cluster. Doing it on a single machine is almost certainly a bad idea. But, for the sake of argument, this is how you do it (untested):

    library(foreach)
    library(doParallel)
    library(doSNOW)
    
    cl = makeCluster(6, type="SOCK")
    registerDoSNOW(cl)
    junk <- foreach(i=1:6, 
                .packages=c("h2o"), 
                .errorhandling = "stop",
                .verbose=TRUE) %dopar% 
    {
       library(h2o)
       h2o.init(ip="localhost", port = 54321 + (i*2), nthreads=2, max_mem_size = "5G") 
    
        Xtr.hf = as.h2o(Xtr)
        Xval.hf = as.h2o(Xval)
    
       for ( j in 1:3 ) { 
         bm2 <- h2o.gbm(
         training_frame = Xtr.hf,  
         validation_frame = Xval.hf, 
         x=2:ncol(Xtr.hf),
         y=1,          
         distribution="gaussian",
         ntrees = 100,
         max_depth = 3,
         learn_rate = 0.1,
         nfolds = 1)
    
        #TODO: save bm2 here
      }
      h2o.shutdown(prompt=FALSE)    
      return(iname)  #???
    }
    stopCluster(cl)
    

    Now the outline is:

    • Create 6 R threads
    • In each thread, start an H2O cluster that is running on localhost but on a port unique to that cluster. (The i*2 is because each H2O cluster is actually using two ports.)
    • Upload your data to the H2O cluster (i.e. this will be repeated 6 times, once for each cluster).
    • Make 3 GBM models, one after each other.
    • Do something with those models
    • Kill the cluster for the current thread.

    If you have 12+ threads on your machine, and 30+ GB memory, and the data is relatively small, this will be roughly as efficient as using one H2O cluster and making 12 GBM models in serial. If not, I believe it will be worse. (But, if you have pre-started 6 H2O clusters on 6 remote machines, this might be a useful approach - I must admit I'd been wondering how to do this, and using the parallel library for it had never occurred to me until I saw your question!)

    NOTE: as of the current version (3.10.0.6), I know the above code won't work, as there is a bug in h2o.init() that effectively means it is ignoring the port. (Workarounds: either pre-start all 6 H2O clusters on the commandline, or set the port in an environment variable.)