I have a machine learning problem: 88 instances, 2 classes (40 instances of "FR" class, 48 instances of "RF" class). I tried, by myself, several different algorithms and, evaluating the results with both cross-validation and leave-one-out, I could not reach more than 0.6 of accuracy. Here is the link to the dataset in csv format: https://drive.google.com/open?id=1lhCOP3Aywk4kGDEStAwL6Uq1H3twSJWS
Trying H2O AutoML with a 10-folds cross validation I reached more or less the same results:cross-validation-leaderbord. But when I tried leave-one-out I had unexpectedly too much better results: leave-one-out-leaderboard
I performed the leave-one-out validation through the fold_column parameter by assigning to each instance a different fold, here is the code:
train <- read.csv("training_set.csv", header = TRUE)
train$ID <- seq.int(nrow(train))
# Identify predictors and response
y <- "class"
x <- setdiff(setdiff(names(train), y), "ID")
# For binary classification, response should be a factor
train[,y] <- as.factor(train[,y])
# Run AutoML for 20 base models
aml <- h2o.automl(x = x, y = y,
fold_column = "ID",
keep_cross_validation_predictions = TRUE,
keep_cross_validation_fold_assignment = TRUE,
sort_metric = "logloss",
training_frame = as.h2o(train),
max_models = 20,
seed = 1)
# View the AutoML Leaderboard
lb <- aml@leaderboard
print(lb, n = nrow(lb))
First of all I do not know if this is the proper way to perform leave-one-out, I tried also to set the n_folds to 88 but I had more or less the same results. Here the information found in aml@leader@model[["cross_validation_metrics"]]:
H2OBinomialMetrics: stackedensemble
** Reported on cross-validation data. **
** 88-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
MSE: 0.1248958
RMSE: 0.353406
LogLoss: 0.4083967
Mean Per-Class Error: 0.075
AUC: 0.8635417
pr_auc: 0.7441933
Gini: 0.7270833
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
FR RF Error Rate
FR 34 6 0.150000 =6/40
RF 0 48 0.000000 =0/48
Totals 34 54 0.068182 =6/88
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.712894 0.941176 53
2 max f2 0.712894 0.975610 53
3 max f0point5 0.712894 0.909091 53
4 max accuracy 0.712894 0.931818 53
5 max precision 0.712894 0.888889 53
6 max recall 0.712894 1.000000 53
7 max specificity 0.739201 0.975000 0
8 max absolute_mcc 0.712894 0.869227 53
9 max min_per_class_accuracy 0.715842 0.850000 46
10 max mean_per_class_accuracy 0.712894 0.925000 53
Although this information seems consistent, another thing which leads me to think there is something wrong is the difference between the above confusion matrix and the one obtained by h2o.confusionMatrix(aml@leader):
Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.117307738035598:
FR RF Error Rate
FR 18 22 0.550000 =22/40
RF 3 45 0.062500 =3/48
Totals 21 67 0.284091 =25/88
Why are the two confusion matrices different? Should not they find the same F1-optimal threshold?
Is there something wrong or is just the Stacked Ensemble that is so much better?
For your question: Why are the two confusion matrices different? Should not they find the same F1-optimal threshold?
Both confusion matrices use the max F1 threshold. The difference may be what dataset is used for calculating F1. You can see the threshold on the first row of the table "Maximum Metrics: Maximum metrics at their respective thresholds."
aml@leader@model[["cross_validation_metrics"]]
looks to be using validation data, and h2o.confusionMatrix(aml@leader)
is using training data. You can try aml@leader@model[["training_metrics"]]
to see if it matches h2o.confusionMatrix(aml@leader)
.