Search code examples
roptimizationhyperparametersmlr3

How Do I Perform Hyperparameter Optimization for a Non-Toy Dataset in R Using mlr3hyperband?


I have a dataset, let's call it "train.csv",

train = na.omit(read_csv('train.csv'))

that I want to use to train an XGBoost predictive model. Now under the example given by the mlr3hyperband documentation, the steps to perform hyperparameter optimization are as follows:

library(mlr3hyperband)
library(mlr3learners)

learner = lrn("classif.xgboost",
  nrounds           = to_tune(p_int(27, 243, tags = "budget")),
  eta               = to_tune(1e-4, 1, logscale = TRUE),
  max_depth         = to_tune(1, 20),
  colsample_bytree  = to_tune(1e-1, 1),
  colsample_bylevel = to_tune(1e-1, 1),
  lambda            = to_tune(1e-3, 1e3, logscale = TRUE),
  alpha             = to_tune(1e-3, 1e3, logscale = TRUE),
  subsample         = to_tune(1e-1, 1)
)

instance = tune(
  tnr("hyperband", eta = 3),
  task = tsk("pima"), # This is the point of challenge.
  learner = learner,
  resampling = rsmp("cv", folds = 3),
  measures = msr("classif.ce")
)

instance$result

However, the "task" parameter under the "instance" function refers to a toy dataset - the pima dataset. I want to tune the model using the train.csv, not these datasets, but I'm not sure how to go about it. I've tried removing the task parameter entirely, but it's needed for the function to run. I've also tried assigning the task parameter to the dataframes of the variable, but that doesn't work either.

# None of the below work.
task = tsk(train)
task = train

Solution

  • According to the mlr3book you need to construct your own task:

    We can also load the data separately and convert it to a task, without using the tsk() function that mlr3 provides. If the data we want to use does not come with mlr3, it has to be done this way.

    this way being e.g. as_task_regr() or as_task_classif() (see 2.1.1 Constructing Tasks)

    disclaimer: no own mlr3 experience