Tried on 28.0.2 and latest 30.0.1 versions.
Create first DRF:
rf1 <- h2o.randomForest(
model_id="first_drf1_x1",
x = f2,
y = r1,
training_frame = train1,
validation_frame = valid1,
ntrees = 49,
nfolds = 5,
seed = 1
)
Train it and they try to continue training from this model like this:
rf2 <- h2o.randomForest(
model_id="second_drf1_x2",
x = f2,
y = r1,
training_frame = train2,
validation_frame = valid2,
ntrees = (49+50),
nfolds = 5,
checkpoint = "first_drf1_x1",
seed = 1
)
Immediately in logs this can be seen:
POST /3/ModelBuilders/drf, parms: {model_id=second_drf1_x2, validation_frame=RTMP_sid_aea1_16, response_column=pcs7_e_dt_4010u, training_frame=RTMP_sid_aea1_14, seed=1, nfolds=5, ntrees=99, ignored_columns=["ts","leve_batch_nbr"], checkpoint=first_drf1_x1}
04-30 10:20:34.601 127.0.0.1:54321 55804 FJ-1-5 INFO: Creating 5 cross-validation splits with random number seed: 1
04-30 10:20:34.612 127.0.0.1:54321 55804 FJ-1-5 ERRR: _weights_column: Weights column '__internal_cv_weights__' not found in the training frame
When the first model created, there are 5 CV models created and they have that internal field set like this:
“_weights_column":"internal_cv_weights",
but when main first model is trained then :
Building main model.
...
“_weights_column":null,
I've opened a bug in h2o jira, but maybe somebody already has seen this issue and have a workaround. If nfolds set to 0 (disabling cross validation) - then everything works just fine
You would need to have nfolds
disabled. As the docs say "Cross-validation is not currently supported for checkpointing."
If you are using new data, it may not make much sense to start from an old model for DRF. The old/original trees (1-49) won't benefit from the additional observations from the new data. The new trees after the checkpoint (50-99) will have the additional observations. So half your trees will be lacking some predictive info which can create some bias in your scoring.