Search code examples
h2oautoml

Use of validation_frame in H2O AutoML


Just started with H2O AutoML so apologies in advance if I have missed something basic.

I have a binary classification problem where data are observations from K years. I want to train on the K-1 years and tune the models and select the best one explicitly based on the remaining K year.

If I switch off cross-validation (with nfolds=0) to avoid randomly blending of years into the N folds and define data of year K as the validation_frame then I don't have the ensemble created (as expected according to the documentation) which in fact I need.

If I train with cross-validation (default nfolds) and defining a validation frame to be the K-year data

aml = H2OAutoML(max_runtime_secs=3600, seed=1)
aml.train(x=x,y=y, training_frame=k-1_years, validation_frame=k_year)

then according to http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html the validation_frame is ignored "...By default and when nfolds > 1, cross-validation metrics will be used for early stopping and thus validation_frame will be ignored."

Is there a way to get the tuning of the models and the selection of the best one(ensemble or not) based on the K-year data only, and while the ensemble of models is also available in the output?

Thanks a lot!


Solution

  • You don't want to have cross-validation (CV) if you are dealing with times-series (non-IID) data, since you won't want folds from the future to the predict the past.

    I would explicitly add nfolds=0 so that CV is disabled in AutoML:

    aml = H2OAutoML(max_runtime_secs=3600, seed=1, nfolds=0)
    aml.train(x=x,y=y, training_frame=k-1_years, validation_frame=k_year)
    

    To have an ensemble, add a blending_frame which also applies to time-series. See more info here.

    Additionally, since you are dealing with time-series data. I would recommend adding time-series transformations (e.g. lags), so that your model gets info from previous years and their aggregates (e.g. weighted moving average).