Search code examples
rh2ogrid-search

Is it possible to use loaded h2o grids for stacked ensembles?


I am currently working on multiple datasets using differend machine learning methods with R and the h2o library. Since I have several 10-fold cross validations for each dataset, I ran a random GridSearch for each and saved the grids using h2o.saveGrid. When I loaded those grids again to build ensembles using h2o.stackedEnsemble it returns the error message

Error: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame id. Looks like keep_cross_validation_predictions wasn't set when building the models.

However, keep_cross_validation_predictions is set to true and it runs perfectly fine if I use the grid without saving and loading it. So I guess that something along the line of loading and saving gets lost.

Does anyone have an idea if there is a way to use loaded grids for stacked ensembles in h2o or is it simply not supported yet? I appreciate any insight since this would save me a lot of time. I cannot keep them all in my h2o cluster all the time

I am using R 3.6.3 and h2o 3.32.0.1

A minimal working example does reproduces the error for me:

library(h2o)
h2o.init()

train_data <- data.frame(y = rnorm(100,1,2),
                         x1 = rnorm(100,5,5),
                         x2 = rnorm(100,4,4),
                         x3 = rnorm(100,3,3),
                         x4 = rnorm(100,2,2))

params <- list(max_depth = seq(1, 6, 1),
               sample_rate = seq(0.2, 1.0, 0.1))
search_criteria <- list(strategy = "RandomDiscrete", max_models = 10, seed = 2102)

train_h2o <- as.h2o(train_data,destination_frame = "Train")

gbm_grid <- h2o.grid("gbm",y = "y", x = c("x1","x2","x3","x4"), training_frame = train_h2o,
                     grid_id = "gbm_1",  nfolds = 10, ntrees = 50, seed= 1111,
                     keep_cross_validation_predictions = TRUE,
                     hyper_params = params, 
                     search_criteria = search_criteria)
h2o.performance(test_ens)

test_ens <- h2o.stackedEnsemble(y = "y", x = c("x1","x2","x3","x4"), training_frame = train_h2o,
                                metalearner_algorithm = "glm", model_id = "Ens1",
                                base_models = gbm_grid@model_ids[1:10])

h2o.saveGrid(grid_directory = paste0(getwd(),"/Data"),grid_id = "gbm_1")

When loading the grid, training the ensemble produces the error

h2o.removeAll()

train_h2o <- as.h2o(train_data,destination_frame = "Train")
gbm_grid <- h2o.loadGrid(paste0(getwd(),"/Data/gbm_1"))

test_ens <- h2o.stackedEnsemble(y = "y", x = c("x1","x2","x3","x4"), training_frame = train_h2o,
                                metalearner_algorithm = "glm", model_id = "Ens2",
                                base_models = gbm_grid@model_ids[1:10])

I have also tried setting export_checkpoints_dir in h2o.grid and manually loading all the models (including their auto-generated cv folds which are, contrary to h2o.saveGrid, also saved this way) but it does not change anything.

Cheers


Solution

  • Stacked Ensemble requires the CV predictions from the base learners, which are not currently being saved when you save the models to disk via h2o.saveModel(). They need to be saved and then reloaded into the model when using h2o.loadModel(), so that they're available for the metalearning step of the Stacked Ensemble algorithm.

    Update: This feature has been added, and is available in H2O 3.32.0.3. Link to download latest stable version of H2O is here.