Search code examples
rh2oconfusion-matrix

R h2o - confusion matrix on cross-validation for mcc threshold


After training my XGBoost model, using 5 fold cross-validation, I would like to get an idea of the model performance on new data. As far as I understand, the performance of the model on each cross-validation run in an acceptable measure of this performance.

Using h2o.performance(best_XGBoost, xval = T) I can get the confusion matrix of the cross-validation. However, the threshold was selected based on F1 and I would like to see the performance using absolute_mcc to select the threshold.

Is there a way to do it?


Solution

  • 1. Performance on new data:

         h2o.confusionMatrix(object = yourXGBmodelHere, 
                            newdata = yourTestSetHere, 
                            metrics = "absolute_mcc")
    

    2. CV performance assessment:

    fold_ass <- h2o.cross_validation_fold_assignment(model)
    cvTrain <- h2o.cbind(data.train, fold_ass)
    

    Example: how model 1 performs on the first fold:

    h2o.confusionMatrix(object=h2o.cross_validation_models(model)[[1]], 
                        newdata=cvTrain[fold_ass == 0, ], 
                        metrics = "absolute_mcc")
    

    NB - it assumes that the model was trained with: keep_cross_validation_fold_assignment = TRUE and keep_cross_validation_predictions = TRUE. So that you can use:

    h2o.cross_validation_fold_assignment(model)
    h2o.cross_validation_predictions(model)
    
    h2o.cross_validation_models(model)