I am currently working on multiple datasets using differend machine learning methods with R and the h2o
library. Since I have several 10-fold cross validations for each dataset, I ran a random GridSearch for each and saved the grids using h2o.saveGrid
. When I loaded those grids again to build ensembles using h2o.stackedEnsemble
it returns the error message
Error: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame id. Looks like keep_cross_validation_predictions wasn't set when building the models.
However, keep_cross_validation_predictions
is set to true and it runs perfectly fine if I use the grid without saving and loading it. So I guess that something along the line of loading and saving gets lost.
Does anyone have an idea if there is a way to use loaded grids for stacked ensembles in h2o
or is it simply not supported yet? I appreciate any insight since this would save me a lot of time. I cannot keep them all in my h2o cluster all the time
I am using R 3.6.3 and h2o 3.32.0.1
A minimal working example does reproduces the error for me:
library(h2o)
h2o.init()
train_data <- data.frame(y = rnorm(100,1,2),
x1 = rnorm(100,5,5),
x2 = rnorm(100,4,4),
x3 = rnorm(100,3,3),
x4 = rnorm(100,2,2))
params <- list(max_depth = seq(1, 6, 1),
sample_rate = seq(0.2, 1.0, 0.1))
search_criteria <- list(strategy = "RandomDiscrete", max_models = 10, seed = 2102)
train_h2o <- as.h2o(train_data,destination_frame = "Train")
gbm_grid <- h2o.grid("gbm",y = "y", x = c("x1","x2","x3","x4"), training_frame = train_h2o,
grid_id = "gbm_1", nfolds = 10, ntrees = 50, seed= 1111,
keep_cross_validation_predictions = TRUE,
hyper_params = params,
search_criteria = search_criteria)
h2o.performance(test_ens)
test_ens <- h2o.stackedEnsemble(y = "y", x = c("x1","x2","x3","x4"), training_frame = train_h2o,
metalearner_algorithm = "glm", model_id = "Ens1",
base_models = gbm_grid@model_ids[1:10])
h2o.saveGrid(grid_directory = paste0(getwd(),"/Data"),grid_id = "gbm_1")
When loading the grid, training the ensemble produces the error
h2o.removeAll()
train_h2o <- as.h2o(train_data,destination_frame = "Train")
gbm_grid <- h2o.loadGrid(paste0(getwd(),"/Data/gbm_1"))
test_ens <- h2o.stackedEnsemble(y = "y", x = c("x1","x2","x3","x4"), training_frame = train_h2o,
metalearner_algorithm = "glm", model_id = "Ens2",
base_models = gbm_grid@model_ids[1:10])
I have also tried setting export_checkpoints_dir
in h2o.grid
and manually loading all the models (including their auto-generated cv folds which are, contrary to h2o.saveGrid
, also saved this way) but it does not change anything.
Cheers
Stacked Ensemble requires the CV predictions from the base learners, which are not currently being saved when you save the models to disk via h2o.saveModel()
. They need to be saved and then reloaded into the model when using h2o.loadModel()
, so that they're available for the metalearning step of the Stacked Ensemble algorithm.
Update: This feature has been added, and is available in H2O 3.32.0.3. Link to download latest stable version of H2O is here.