Search code examples
rh2oautoml

H2O "grid: Cannot append new models to a grid with different training input" error when parallelizing the execution of autoML in a foreach loop


I'm trying to parallelize the training of multiple ML models using the autoML feature provided by H2O. The core code I'm using is the following:

library(foreach)
library(doParallel)

project_folder <- "/home/user/Documents/"

ncores <- parallel::detectCores(logical = FALSE)
nlogiccpu <- parallel::detectCores()
max_mem_size <- "4G"

cl<-makeCluster(nlogiccpu)

registerDoParallel(cl)

df4 <-foreach(i = as.numeric(seq(1,length(divisions))), .combine=rbind) %dopar% {
  library(dplyr)
  library(h2o)
  h2o.init(nthreads = ncores, max_mem_size = max_mem_size)

  div <- divisions[i]

  df.h2o <- as.h2o(
    df %>% filter(code == div) )

  y <- "TARGET"
  x <- names(df.train.x.discretized)

  automl.models.h2o <- h2o.automl(
    x = x,
    y = y,
    training_frame = df.h2o,
    nfolds = 10,
    seed = 111,
    project_name = paste0("PRJ_", div)
  )

  leader <- automl.models.h2o@leader

  div_folder <- file.path(project_folder, paste0("Division_", div))
  h2o.saveModel(leader,
                path = file.path(div_folder, "TARGET_model_bin"))
  ...
}

Only a part of all the models are trained and saved in their folder, because at some point I got the following error:

water.exceptions.H2OIllegalArgumentException: Illegal argument: training_frame of function: grid: Cannot append new models to a grid with different training input

I suppose grids are used during the autoML phase, so I tried to find a parameter to pass the grid_id as I can do in the h2o.grid function as following:

grid <- h2o.grid(“gbm”,  grid_id = paste0(“gbm_grid_id”, div),
                 ...)

but I can't find the way to do that. The H2O package version I'm using is the 3.24.0.2.

Any suggestion?


Solution

  • The short answer to the question is that you cannot use different training frames in a single grid. Each grid of models must be associated with a single training set (the idea is that you do not want to compare models trained on different training sets). This is why you are hitting the error. It looks like each of your df.h2o training frames are different subsets of the original df frame.

    Another note: H2O and R's parallel functionality don't mix well. H2O model training is already parallelized, but in a different way (for scalability reasons). The training of a single model is parallelized within H2O (on multiple cores), but H2O is not designed to train multiple models at once. If you want to train multiple models at once on a single machine, then you would have to start multiple H2O clusters in different R sessions on different ports.