r machine-learning cross-validation h2o automl

H2O AutoML leave-one-out performs too much better than 10-folds cross-validation

I have a machine learning problem: 88 instances, 2 classes (40 instances of "FR" class, 48 instances of "RF" class). I tried, by myself, several different algorithms and, evaluating the results with both cross-validation and leave-one-out, I could not reach more than 0.6 of accuracy. Here is the link to the dataset in csv format: https://drive.google.com/open?id=1lhCOP3Aywk4kGDEStAwL6Uq1H3twSJWS

Trying H2O AutoML with a 10-folds cross validation I reached more or less the same results:cross-validation-leaderbord. But when I tried leave-one-out I had unexpectedly too much better results: leave-one-out-leaderboard

I performed the leave-one-out validation through the fold_column parameter by assigning to each instance a different fold, here is the code:

train <- read.csv("training_set.csv", header = TRUE)
train$ID <- seq.int(nrow(train))

# Identify predictors and response
y <- "class"
x <- setdiff(setdiff(names(train), y), "ID")

# For binary classification, response should be a factor
train[,y] <- as.factor(train[,y])

# Run AutoML for 20 base models 
aml <- h2o.automl(x = x, y = y,
                  fold_column = "ID",
                  keep_cross_validation_predictions = TRUE,
                  keep_cross_validation_fold_assignment = TRUE,
                  sort_metric = "logloss",
                  training_frame = as.h2o(train),
                  max_models = 20,
                  seed = 1)

# View the AutoML Leaderboard
lb <- aml@leaderboard
print(lb, n = nrow(lb))

First of all I do not know if this is the proper way to perform leave-one-out, I tried also to set the n_folds to 88 but I had more or less the same results. Here the information found in aml@leader@model[["cross_validation_metrics"]]:

H2OBinomialMetrics: stackedensemble
** Reported on cross-validation data. **
** 88-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  0.1248958
RMSE:  0.353406
LogLoss:  0.4083967
Mean Per-Class Error:  0.075
AUC:  0.8635417
pr_auc:  0.7441933
Gini:  0.7270833

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
       FR RF    Error   Rate
FR     34  6 0.150000  =6/40
RF      0 48 0.000000  =0/48
Totals 34 54 0.068182  =6/88

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold    value idx
1                       max f1  0.712894 0.941176  53
2                       max f2  0.712894 0.975610  53
3                 max f0point5  0.712894 0.909091  53
4                 max accuracy  0.712894 0.931818  53
5                max precision  0.712894 0.888889  53
6                   max recall  0.712894 1.000000  53
7              max specificity  0.739201 0.975000   0
8             max absolute_mcc  0.712894 0.869227  53
9   max min_per_class_accuracy  0.715842 0.850000  46
10 max mean_per_class_accuracy  0.712894 0.925000  53

Although this information seems consistent, another thing which leads me to think there is something wrong is the difference between the above confusion matrix and the one obtained by h2o.confusionMatrix(aml@leader):

Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.117307738035598:
       FR RF    Error    Rate
FR     18 22 0.550000  =22/40
RF      3 45 0.062500   =3/48
Totals 21 67 0.284091  =25/88

Why are the two confusion matrices different? Should not they find the same F1-optimal threshold?

Is there something wrong or is just the Stacked Ensemble that is so much better?

Solution

With just 88 instances of data, there is risk of overfitting. To ensure you are not overfitting, you should take a sample of your data as holdout/test (the model/training won't see) then use the rest for training and cross-validation. You can then use the holdout data to see if it performs similarly to what you found from validation and see if LOO is much better.

For your question: Why are the two confusion matrices different? Should not they find the same F1-optimal threshold?

Both confusion matrices use the max F1 threshold. The difference may be what dataset is used for calculating F1. You can see the threshold on the first row of the table "Maximum Metrics: Maximum metrics at their respective thresholds."
aml@leader@model[["cross_validation_metrics"]] looks to be using validation data, and h2o.confusionMatrix(aml@leader) is using training data. You can try aml@leader@model[["training_metrics"]] to see if it matches h2o.confusionMatrix(aml@leader).