Search code examples
machine-learningh2omedical

H2O stacked ensemble with models using different inputs


Using h2o flow, is there a way to create a stacked ensemble model based on individual models that may not take the same inputs but predict on the same response labels.

Eg. I am trying to predict for miscoded healthcare claims (ie. charges) and would like to train models for a stacked ensemble of the form:

model1(diagnosis1, diagnosis2, ..., diagnosis5) -> denied or paid (by insurer)
model2(procedure, procedure_detail1, ..., procedure_detail5) -> denied or paid 
model3(service_date, insurance_amount, insurer_id) -> (same)
model4(pat_age, pat_sex, ...) -> (same)
...

Is there a way to do this in h2o flow (can't tell how to do this with what is presented in the h2o flow gui for stacked ensemble)? Is this even a sensible way to go about this or is it confused in some way (relatively new to machine learning)? Thanks.


Solution

  • Darren's response that you can't do this in H2O was correct until very recently -- H2O just removed the requirement that the base models had to be trained on the same set of inputs since it's not actually required by the Stacked Ensemble algorithm. This is only available on the nightly releases off of master though, so even if you're on the latest stable release, you'd see an error that looks like this (in Flow, R, Python, etc) if you tried to use models that don't use the exact same columns:

    Error: water.exceptions.H2OIllegalArgumentException: Base models are inconsistent: they use different column lists.  Found: [x6, x7, x4, x5, x2, x3, x1, x9, x8, x10, response] and: [x10, x16, x15, x18, x17, x12, x11, x14, x13, x19, x9, x8, x20, x21, x28, x27, x26, x25, x24, x23, x22, x6, x7, x4, x5, x2, x3, x1, response].  
    

    The metalearning step in the Stacked Ensemble algorithm combines the output from the base models, so the number of inputs that went into training the base models doesn't really matter. Currently, H2O still requires that the inputs are all part of the same original training_frame -- but you can use a different x for each base model if you like (the x argument specifies which of the columns from the training_frame you want to use in your model).

    The way that Stacked Ensemble works in Flow is that it looks for models that are all "compatible", in other words -- trained on, the same data frame. Then you select from this list which ones you want to include in the ensemble. So as long as you are using the latest development version of H2O, then this is how to do what you want to do in Flow.

    select ensemble base models in H2O Flow

    Here's an R example of how to ensemble models that are trained on different subsets of the feature space:

    library(h2o)
    h2o.init()
    
    # Import a sample binary outcome training set into H2O
    train <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
    test <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
    
    # Identify predictors and response
    y <- "response"
    x <- setdiff(names(train), y)
    
    # For binary classification, response should be a factor
    train[,y] <- as.factor(train[,y])
    test[,y] <- as.factor(test[,y])
    
    # Train & Cross-validate a GBM using a subset of features
    my_gbm <- h2o.gbm(x = x[1:10],
                      y = y,
                      training_frame = train,
                      distribution = "bernoulli",
                      nfolds = 5,
                      keep_cross_validation_predictions = TRUE,
                      seed = 1)
    
    # Train & Cross-validate a RF using a subset of features
    my_rf <- h2o.randomForest(x = x[3:15],
                              y = y,
                              training_frame = train,
                              nfolds = 5,
                              keep_cross_validation_predictions = TRUE,
                              seed = 1)
    
    # Train a stacked ensemble using the GBM and RF above
    ensemble <- h2o.stackedEnsemble(y = y, training_frame = train,
                                    base_models = list(my_gbm, my_rf))
    
    # Check out ensemble performance
    perf <- h2o.performance(ensemble, newdata = test)
    h2o.auc(perf)