Search code examples
h2oweightedautomlgbm

H2O AutoML - how to provide weights


Here is my example with Default date set from ISLR package. The data is imbalanced so I rebalance it and run H2O AutoML with GBMs only.

library(ISLR)
library(h2o)
library(magrittr)
library(dplyr)
core_count <- detectCores()
h2o.init(nthreads = (core_count -1))

my_df <- Default
x <- setdiff(colnames(df_train), 'default')
y <- 'default'
    
    my_df %<>% mutate(weights = if_else(default =='No',
0.6/table(my_df$default)[[1]],0.4/table(my_df$default)[[2]]))

aml_test <- h2o.automl(x = x, y = y,
                  training_frame = as.h2o(my_df[1:8000, ]),
                  validation_frame = as.h2o(my_df[8001:10000, ]),
                  nfolds = 0, 
                  weights_column = "weights",
                  include_algos = c('GBM'),
                  seed = 12345,
                  max_runtime_secs = 1200)

It generates the following errors:

        09:46:49.611: Skipping training of model GBM_1_AutoML_20210821_094649 due to exception:
     water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for GBM model:
     GBM_1_AutoML_20210821_094649.  Details: ERRR on field: _min_rows: The dataset size is too 
    small to split for min_rows=1.0: must have at least 2.0 (weighted) rows, but have only 
    0.7172904568994339.
    
    09:46:49.622: Skipping training of model GBM_2_AutoML_20210821_094649 due to exception:
 water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for GBM model: 
GBM_2_AutoML_20210821_094649.  Details: ERRR on field: _min_rows: The dataset size is too 
small to split for min_rows=10.0: must have at least 20.0 (weighted) rows, but have only 
0.7172904568994339.
    
    09:46:49.630: Skipping training of model GBM_3_AutoML_20210821_094649 due to exception: 
water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for GBM model: 
GBM_3_AutoML_20210821_094649.  Details: ERRR on field: _min_rows: The dataset size is too 
small to split for min_rows=10.0: must have at least 20.0 (weighted) rows, but have only 
0.7172904568994339.

    
    09:46:49.637: Skipping training of model GBM_4_AutoML_20210821_094649 due to exception: 
water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for GBM model: 
GBM_4_AutoML_20210821_094649.  Details: ERRR on field: _min_rows: The dataset size is too 
small to split for min_rows=10.0: must have at least 20.0 (weighted) rows, but have only 
0.7172904568994339.

    
    09:46:49.644: Skipping training of model GBM_5_AutoML_20210821_094649 due to exception: 
water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for GBM model: 
GBM_5_AutoML_20210821_094649.  Details: ERRR on field: _min_rows: The dataset size is too 
small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 
0.7172904568994339.

      |===================================================================================| 100%
    



       09:49:50.241: Empty leaderboard.

    AutoML was not able to build any model within a max runtime constraint of 1200 seconds,
 you may want to increase this value before retrying.The leaderboard contains zero models: 
try running AutoML for longer (the default is 1 hour).

Essentially it does not work with GBM whenever weights for classes are provided. It works fine without weights. It even did not run for full 20 minutes. No models are generated.


Solution

  • There is an error message showing up in your output

    Details: ERRR on field: _min_rows: The dataset size is too 
    small to split for min_rows=10.0: must have at least 20.0 (weighted) rows, but have only 
    0.7xxxx.
    

    It seems like you need to increase your weight values and/or increase number of rows. Try multiplying your weight column by 10 or 100x and see if it helps. I suspect this wouldn't be an issue if you try setting weights columns to all ones.