Search code examples
pythonmachine-learningcross-validationh2oautoml

Retrieve cross validation performance (AUC) on h2o AutoML for holdout dataset


I am training a binary classification model with h2o AutoML using the default cross-validation (nfolds=5). I need to obtain the AUC score for each holdout fold in order to compute the variability.

This is the code I am using:

h2o.init()

prostate = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv")
# convert columns to factors
prostate['CAPSULE'] = prostate['CAPSULE'].asfactor()
prostate['RACE'] = prostate['RACE'].asfactor()
prostate['DCAPS'] = prostate['DCAPS'].asfactor()
prostate['DPROS'] = prostate['DPROS'].asfactor()

# set the predictor and response columns
predictors = ["AGE", "RACE", "VOL", "GLEASON"]
response_col = "CAPSULE"

# split into train and testing sets
train, test = prostate.split_frame(ratios = [0.8], seed = 1234)


aml = H2OAutoML(seed=1, max_runtime_secs=100, exclude_algos=["DeepLearning", "GLM"],
                    nfolds=5, keep_cross_validation_predictions=True)

aml.train(predictors, response_col, training_frame=prostate)

leader = aml.leader

I check that leader is not a StackedEnsamble model (for which the validation metrics are not available). Anyway, I am not able to retrieve the five AUC scores.

Any idea on how to do so?


Solution

  • Here's how it's done:

    import h2o
    from h2o.automl import H2OAutoML
    
    h2o.init()
    
    # import prostate dataset
    prostate = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv")
    # convert columns to factors
    prostate['CAPSULE'] = prostate['CAPSULE'].asfactor()
    prostate['RACE'] = prostate['RACE'].asfactor()
    prostate['DCAPS'] = prostate['DCAPS'].asfactor()
    prostate['DPROS'] = prostate['DPROS'].asfactor()
    
    # set the predictor and response columns
    predictors = ["AGE", "RACE", "VOL", "GLEASON"]
    response_col = "CAPSULE"
    
    # split into train and testing sets
    train, test = prostate.split_frame(ratios = [0.8], seed = 1234)
    
    # run AutoML for 100 seconds
    aml = H2OAutoML(seed=1, max_runtime_secs=100, exclude_algos=["DeepLearning", "GLM"],
                        nfolds=5, keep_cross_validation_predictions=True)
    aml.train(x=predictors, y=response_col, training_frame=prostate)
    
    # Get the leader model
    leader = aml.leader
    

    There is a caveat to mention here about cross-validated AUC -- H2O currently stores two computations of CV AUC. One is an aggregated version (take the AUC of aggregated CV predictions), and the other is the "true" definition of cross-validated AUC (an average of the k AUCs from k-fold cross-validation). The latter is stored in an object which also contains the individual fold AUCs, as well as the standard deviation across the folds.

    If you're wondering why we do this, there's some historical & technical reasons why we have two versions, as well as a ticket open to only every report the latter.

    The first one is what you get when you do this (and also what appears on the AutoML Leaderboard).

    # print CV AUC for leader model
    print(leader.model_performance(xval=True).auc())
    

    If you want the fold-wise AUCs so you can compute or view their mean and variability (standard deviation), you can do that by looking here:

    # print CV metrics summary
    leader.cross_validation_metrics_summary()
    

    Output:

    Cross-Validation Metrics Summary:
                 mean        sd           cv_1_valid    cv_2_valid    cv_3_valid    cv_4_valid    cv_5_valid
    -----------  ----------  -----------  ------------  ------------  ------------  ------------  ------------
    accuracy     0.71842104  0.06419111   0.7631579     0.6447368     0.7368421     0.7894737     0.65789473
    auc          0.7767409   0.053587236  0.8206676     0.70905924    0.7982079     0.82538515    0.7303846
    aucpr        0.6907578   0.0834025    0.78737605    0.7141305     0.7147677     0.67790955    0.55960524
    err          0.28157896  0.06419111   0.23684211    0.35526314    0.2631579     0.21052632    0.34210527
    err_count    21.4        4.8785243    18.0          27.0          20.0          16.0          26.0
    ---          ---         ---          ---           ---           ---           ---           ---
    precision    0.61751753  0.08747421   0.675         0.5714286     0.61702126    0.7241379     0.5
    r2           0.20118153  0.10781976   0.3014902     0.09386432    0.25050205    0.28393403    0.07611712
    recall       0.84506994  0.08513061   0.84375       0.9142857     0.9354839     0.7241379     0.8076923
    rmse         0.435928    0.028099842  0.41264254    0.47447023    0.42546       0.41106534    0.4560018
    specificity  0.62579334  0.15424488   0.70454544    0.41463414    0.6           0.82978725    0.58
    
    See the whole table with table.as_data_frame()
    

    Here's what the leaderboard looks like (storing aggregated CV AUCs). In this case, because the data is so small (300 rows), there's a noticeable difference between the two reported between the two reported CV AUC values, however for larger datasets, they should be much closer estimates.

    # print the whole Leaderboard (all CV metrics for all models)
    lb = aml.leaderboard
    print(lb)
    

    That will print the top of the leaderboard:

    model_id                                                  auc    logloss     aucpr    mean_per_class_error      rmse       mse
    ---------------------------------------------------  --------  ---------  --------  ----------------------  --------  --------
    XGBoost_grid__1_AutoML_20200924_200634_model_2       0.769716   0.565326  0.668827                0.290806  0.436652  0.190665
    GBM_grid__1_AutoML_20200924_200634_model_4           0.762993   0.56685   0.666984                0.279145  0.437634  0.191524
    XGBoost_grid__1_AutoML_20200924_200634_model_9       0.762417   0.570041  0.645664                0.300121  0.440255  0.193824
    GBM_grid__1_AutoML_20200924_200634_model_6           0.759912   0.572651  0.636713                0.30097   0.440755  0.194265
    StackedEnsemble_BestOfFamily_AutoML_20200924_200634  0.756486   0.574461  0.646087                0.294002  0.441413  0.194845
    GBM_grid__1_AutoML_20200924_200634_model_7           0.754153   0.576821  0.641462                0.286041  0.442533  0.195836
    XGBoost_1_AutoML_20200924_200634                     0.75411    0.584216  0.626074                0.289237  0.443911  0.197057
    XGBoost_grid__1_AutoML_20200924_200634_model_3       0.753347   0.57999   0.629876                0.312056  0.4428    0.196072
    GBM_grid__1_AutoML_20200924_200634_model_1           0.751706   0.577175  0.628564                0.273603  0.442751  0.196029
    XGBoost_grid__1_AutoML_20200924_200634_model_8       0.749446   0.576686  0.610544                0.27844   0.442314  0.195642
    
    [28 rows x 7 columns]