r h2o mlr

Tuning hyperparameters in mlr does not produce sensible results?

I am trying to tune the hyperparameters in mlr using the tuneParams function. However, I can't make sense of the results it is giving me (or else Im using it incorrectly).

For example, if I create some data with a binary response and then create an mlr h2o classification model and then check the accuracy and AUC I will get some values. Then, if I use tuneParams on some parameters and find a better accuracy and AUC and then plug them into my model. The resulting accuracy and AUC (for the model) does not match that found by using tuneParams.

Hopefully the code below will illustrate my issue:

library(mlr)

# Create data
set.seed(1234)
Species <- sample(c("yes", "no"), size = 150, replace = T)

dat <- data.frame(
  x1 = (Species == "yes") + rnorm(150),
  x2 = (Species == "no") + rnorm(150), Species
)

# split into training and test
train <- sample(nrow(dat), round(.7*nrow(dat))) # split 70-30
datTrain <- dat[train, ]
datTest <- dat[-train, ]

# create mlr h2o model
task <- makeClassifTask(data = dat, target = "Species")
learner <- makeLearner("classif.h2o.deeplearning", predict.type = "prob", 
                       par.vals = list(reproducible = TRUE,
                                       seed = 1))
Mod <- train(learner, task)

# Test predictions
pred <- predict(Mod, newdata = datTest)
# Evaluate performance accuracy & area under curve 
performance(pred, measures = list(acc, auc))

The result of the above performance check is:

acc       auc 
0.7111111 0.7813765

Now, if I tune just one of the parameters (e.g., epochs):

set.seed(1234)
# Tune epoch parameter
param_set <- makeParamSet(
  makeNumericParam("epochs", lower = 1, upper = 10))
rdesc <- makeResampleDesc("CV", iters = 3L, predict = "both") 
ctrl <- makeTuneControlRandom(maxit = 3)

res <- tuneParams(
  learner = learner, task = task, resampling = rdesc, measures = list(auc, acc),
  par.set = param_set, control = ctrl
)

the result I get from tuning epochs is:

Tune result:
Op. pars: epochs=1.95
auc.test.mean=0.8526496,acc.test.mean=0.7466667

Now, if I plug that value for the epochs into the learner and run the model again and check the performance:

set.seed(1234)
# plugging the tuned value into model and checking performance again:
learner <- makeLearner("classif.h2o.deeplearning", predict.type = "prob", 
                       par.vals = list(epochs = 1.95,
                                       reproducible = TRUE,
                                       seed = 1))
Mod <- train(learner, task)

# Test predictions
pred1 <- predict(Mod, newdata = datTest)
# Evaluate performance accuracy & area under curve 
performance(pred1, measures = list(acc, auc))

The resulting accuracy and AUC I get is now:

   acc       auc 
0.6666667 0.8036437

My question is, why is there such a difference between the accuracy and AUC of the results of using tuneParams and when I plug the tuned values into the learner? Or am I using or interpreting tuneParams incorrectly?

Solution

You're getting different results because you're evaluating the learner using different train and test data. If I use the same 3-fold CV, I get the same results:

set.seed(1234)
resample(learner, task, cv3, list(auc, acc))

Aggr perf: auc.test.mean=0.8526496,acc.test.mean=0.7466667

In general, every computed performance is only an estimate of the true generalization performance. This will vary depending on what method of resampling you choose and what data.