Seeing error
TypeError: unsupported operand type(s) for +: 'NoneType' and 'unicode'
when trying to use gridsearch for training a model in h2o and am unable to interpret the cause.
Here is the output that gets printed right before the error:
drf Grid Build progress: |████████████████████████████████████████████████| 100%
Errors/Warnings building gridsearch model
Hyper-parameter: col_sample_rate_per_tree, 0.75
Hyper-parameter: max_depth, 5
Hyper-parameter: min_rows, 4096.0
Hyper-parameter: min_split_improvement, 1e-08
Hyper-parameter: mtries, 8
Hyper-parameter: nbins, 8
Hyper-parameter: nbins_cats, 64
Hyper-parameter: ntrees, 96
Hyper-parameter: sample_rate, 0.6320000291
failure_details: None
failure_stack_traces: java.lang.NullPointerException
at hex.tree.SharedTree.init(SharedTree.java:164)
at hex.tree.drf.DRF.init(DRF.java:53)
at hex.tree.SharedTree$Driver.computeImpl(SharedTree.java:207)
at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:222)
at hex.ModelBuilder.trainModelNested(ModelBuilder.java:348)
at hex.ModelBuilder$TrainModelNestedRunnable.run(ModelBuilder.java:383)
at water.H2O.runOnH2ONode(H2O.java:1304)
at water.H2O.runOnH2ONode(H2O.java:1297)
at hex.ModelBuilder.trainModelNested(ModelBuilder.java:364)
at hex.grid.GridSearch.buildModel(GridSearch.java:343)
at hex.grid.GridSearch.gridSearch(GridSearch.java:220)
at hex.grid.GridSearch.access$000(GridSearch.java:71)
at hex.grid.GridSearch$1.compute2(GridSearch.java:138)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1416)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
and the code being used to create the gridsearch object
model = h2o.h2o.H2ORandomForestEstimator(
response_column=configs['RESPONSE'],
keep_cross_validation_models=False,
keep_cross_validation_predictions=False
)
random_forest_grid = h2o.h2o.H2OGridSearch(model=model,
hyper_params=configs['HYPERPARAMETER_RANGES'],
search_criteria=configs['SEARCH_CRITERIA'])
.
.
.
max_train_time_hrs = 8
# here is where the ERROR is thrown
random_forest_grid.train(x=training_features, y=training_response,
weights_column='weight',
training_frame=train_u, validation_frame=test_u,
max_runtime_secs=max_train_time_hrs * 60 * 60)
where the configs
being referred to is a dictionary like...
configs = {
.
.
.
'HYPERPARAMETER_RANGES': {
'ntrees': [32, 64, 96, 128], # default 50
'nbins_cats': [16, 32, 64, 128, 512, 1024], # default is 1024
'nbins': [8, 13, 21, 34], # default is 20
'max_depth': [5, 8, 13], # default is 20
'mtries': [-1, 5, 8, 13], # default is -1 for the square root of number of features
'min_split_improvement': [1 * 10 ** -8,
1 * 10 ** -5,
1 * 10 ** -3],
'min_rows': [16, 64, 256, 1024, 4096], # this option specifies the number of observations for a split
'col_sample_rate_per_tree': [0.75, 0.9, 1], # default is 1
'sample_rate': [0.5, 0.6320000291, 0.75] # default is 0.6320000291
},
'SEARCH_CRITERIA': {
'strategy': 'RandomDiscrete',
'max_models': 24,
'seed': 64,
'stopping_metric': 'AUTO', # log-loss
}
}
Note that the gridsearch works for some other DRF models I am training (with exact same gridsearch hyper-parameter and criteria ranges) and can't seem to find any notable difference between these working versions and this erroring one. Any common reasons why this kind of error may be thrown in h2o? Any theories or further debugging suggestions would be appreciated.
Found the cause of the error by checking the logs in the h2o Flow UI
(which I would say is a good h2o debugging tip in general (as it appears some errors only print there and not the standard error output)).
06-20 12:39:02.188 172.18.4.64:54321 27694 FJ-1-11 INFO: Building H2O DRF model with these parameters:
06-20 12:39:02.188 172.18.4.64:54321 27694 FJ-1-11 INFO: {"_train":{"name":"py_9_sid_827e","type":"Key"},"_valid":{"name":"py_10_sid_827e","type":"Key"},"_nfolds":0,"_keep_cross_validation_models":false,"_keep_cross_validation_predictions":false,"_keep_cross_validation_fold_assignment":false,"_parallelize_cross_validation":true,"_auto_rebalance":true,"_seed":111,"_fold_assignment":"AUTO","_categorical_encoding":"AUTO","_max_categorical_levels":10,"_distribution":"AUTO","_tweedie_power":1.5,"_quantile_alpha":0.5,"_huber_alpha":0.9,"_ignored_columns":null,"_ignore_const_cols":true,"_weights_column":"weight","_offset_column":null,"_fold_column":null,"_check_constant_response":true,"_is_cv_model":false,"_score_each_iteration":false,"_max_runtime_secs":28800.0,"_stopping_rounds":0,"_stopping_metric":"AUTO","_stopping_tolerance":0.001,"_response_column":"DENIAL","_balance_classes":false,"_max_after_balance_size":5.0,"_class_sampling_factors":null,"_max_confusion_matrix_size":20,"_checkpoint":null,"_pretrained_autoencoder":null,"_custom_metric_func":null,"_export_checkpoints_dir":null,"_ntrees":96,"_max_depth":13,"_min_rows":64.0,"_nbins":13,"_nbins_cats":16,"_min_split_improvement":1.0E-5,"_histogram_type":"AUTO","_r2_stopping":1.7976931348623157E308,"_nbins_top_level":1024,"_build_tree_one_node":false,"_score_tree_interval":0,"_initial_score_interval":4000,"_score_interval":4000,"_sample_rate":0.6320000291,"_sample_rate_per_class":null,"_calibrate_model":false,"_calibration_frame":null,"_col_sample_rate_change_per_level":1.0,"_col_sample_rate_per_tree":1.0,"_binomial_double_trees":true,"_mtries":5}
06-20 12:39:02.189 172.18.4.64:54321 27694 FJ-1-11 ERRR: _weights_column: Weights column 'weight' not found in the training frame
06-20 12:39:02.189 172.18.4.64:54321 27694 FJ-1-11 ERRR: _weights_column: Weights column 'weight' not found in the training frame
It turns out that the problem was due to the fact that the column assigned to be used as the weights_column
param in the grid search was not actually present in the H2OFrame being used.
Will try to pare down question post to be more relevent to others that may find this problem based only on the title (since the standard error printed in the console gives no indication of the specific problem).