Search code examples
python-3.xh2oensemble-learning

Ensemble model in H2O with fold_column argument


I am new to H2O in python. I am trying to model my data using ensemble model following the example codes from H2O's web site. (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/stacked-ensembles.html)

I have applied GBM and RF as base models. And then using stacking, I tried to merge them in ensemble model. In addition, in my training data I created one additional column named 'fold' to be used in fold_column = "fold"

I applied 10 fold cv and I observed that I received results from cv1. However, all the predictions coming from other 9 cvs, they are empty. What am I missing here?

Here is my sample data:

enter image description here

code:

import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
from h2o.grid.grid_search import H2OGridSearch
from __future__ import print_function

h2o.init(port=23, nthreads=6)

train = h2o.H2OFrame(ens_df)
test = h2o.H2OFrame(test_ens_eq)

x = train.drop(['Date','EQUITY','fold'],axis=1).columns
y = 'EQUITY'

cat_cols = ['A','B','C','D']
train[cat_cols] = train[cat_cols].asfactor()
test[cat_cols] = test[cat_cols].asfactor()

my_gbm = H2OGradientBoostingEstimator(distribution="gaussian",
                                      ntrees=10,
                                      max_depth=3,
                                      min_rows=2,
                                      learn_rate=0.2,
                                      keep_cross_validation_predictions=True,
                                      seed=1)

my_gbm.train(x=x, y=y, training_frame=train, fold_column = "fold")

Then when I check cv results with

my_gbm.cross_validation_predictions():

enter image description here

Plus when I try the ensemble in the test set I get the warning below:

# Train a stacked ensemble using the GBM and GLM above
ensemble = H2OStackedEnsembleEstimator(model_id="mlee_ensemble",
                                       base_models=[my_gbm, my_rf])
ensemble.train(x=x, y=y, training_frame=train)

# Eval ensemble performance on the test data
perf_stack_test = ensemble.model_performance(test)

pred = ensemble.predict(test)
pred

/mgmt/data/conda/envs/python3.6_4.4/lib/python3.6/site-packages/h2o/job.py:69: UserWarning: Test/Validation dataset is missing column 'fold': substituting in a column of NaN
  warnings.warn(w)

Am I missing something about fold_column?


Solution

  • Here is an example of how to use a custom fold column (created from a list). This is a modified version of the example Python code in the Stacked Ensemble page in the H2O User Guide.

    import h2o
    from h2o.estimators.random_forest import H2ORandomForestEstimator
    from h2o.estimators.gbm import H2OGradientBoostingEstimator
    from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
    from h2o.grid.grid_search import H2OGridSearch
    from __future__ import print_function
    h2o.init()
    
    # Import a sample binary outcome training set into H2O
    train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
    
    # Identify predictors and response
    x = train.columns
    y = "response"
    x.remove(y)
    
    # For binary classification, response should be a factor
    train[y] = train[y].asfactor()
    
    # Add a fold column, generate from a list
    # The list has 10 unique values, so there will be 10 folds
    fold_list = list(range(10)) * 1000
    train['fold_id'] = h2o.H2OFrame(fold_list)
    
    
    # Train and cross-validate a GBM
    my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli",
                                          ntrees=10,
                                          keep_cross_validation_predictions=True,
                                          seed=1)
    my_gbm.train(x=x, y=y, training_frame=train, fold_column="fold_id")
    
    # Train and cross-validate a RF
    my_rf = H2ORandomForestEstimator(ntrees=50,
                                     keep_cross_validation_predictions=True,
                                     seed=1)
    my_rf.train(x=x, y=y, training_frame=train, fold_column="fold_id")
    
    # Train a stacked ensemble using the GBM and RF above
    ensemble = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf])
    ensemble.train(x=x, y=y, training_frame=train)
    

    To answer your second question about how to view the cross-validated predictions in a model. They are stored in two places, however, the method that you probably want to use is: .cross_validation_holdout_predictions() This method returns a single H2OFrame of the cross-validated predictions, in the original order of the training observations:

    In [11]: my_gbm.cross_validation_holdout_predictions()
    Out[11]:
      predict        p0        p1
    ---------  --------  --------
            1  0.323155  0.676845
            1  0.248131  0.751869
            1  0.288241  0.711759
            1  0.407768  0.592232
            1  0.507294  0.492706
            0  0.6417    0.3583
            1  0.253329  0.746671
            1  0.289916  0.710084
            1  0.524328  0.475672
            1  0.252006  0.747994
    
    [10000 rows x 3 columns]
    

    The second method, .cross_validation_predictions() is a list which stores the predictions from each fold in an H2OFrame that has the same number of rows as the original training frame, but the rows that are not active in that fold have a value of zero. This is not usually the format that people find most useful, so I'd recommend using the other method instead.

    In [13]: type(my_gbm.cross_validation_predictions())
    Out[13]: list
    
    In [14]: len(my_gbm.cross_validation_predictions())
    Out[14]: 10
    
    In [15]: my_gbm.cross_validation_predictions()[0]
    Out[15]:
      predict        p0        p1
    ---------  --------  --------
            1  0.323155  0.676845
            0  0         0
            0  0         0
            0  0         0
            0  0         0
            0  0         0
            0  0         0
            0  0         0
            0  0         0
            0  0         0
    
    [10000 rows x 3 columns]