mlr3 PipeOps: Create branches with different data transformations and benchmark different learners within and between branches

I'd like use PipeOps to train a learner on three alternative transformations of a dataset:

No transformation.
Class balancing- down.
Class balancing- up.

Then, I'd like to benchmark the three learned models.

My idea was to set up the pipeline as follows:

Make pipeline: Input -> Impute dataset (optional) -> Branch -> Split into the three branches described above -> Add the learner within each branch -> Unbranch.
Train pipeline and hope (that's where I'm getting it wrong) that the will be a result saved for each learner within each branch.

Unfortunately, following these steps results in a single learner that seems to have 'merged' everything from the different branches. I was hoping to get a list of length 3, but I get a list of length one instead.

R code:

library(data.table)
library(paradox)
library(mlr3)
library(mlr3filters)
library(mlr3learners)
library(mlr3misc)
library(mlr3pipelines)
library(mlr3tuning)
library(mlr3viz)

learner <- lrn("classif.rpart", predict_type = "prob")
learner$param_set$values <- list(
  cp = 0,
  maxdepth = 21,
  minbucket = 12,
  minsplit = 24
)

graph = 
  po("imputehist") %>>%
  po("branch", c("nop", "classbalancing_up", "classbalancing_down")) %>>%
  gunion(list(
    po("nop", id = "null"),
    po("classbalancing", id = "classbalancing_down", ratio = 2, reference = 'minor'), 
    po("classbalancing", id = "classbalancing_up", ratio = 2, reference = 'major')
  )) %>>%
  gunion(list(
    po("learner", learner, id = "learner_null"),
    po("learner", learner, id = "learner_classbalancing_down"),
    po("learner", learner, id = "learner_classbalancing_up")
  )) %>>%
  po("unbranch")

plot(graph)

tr <- mlr3::resample(tsk("iris"), graph, rsmp("holdout"))

tr$learners

Question 1 How can I get three different results instead?

Question 2 How can I benchmark these three results within the pipeline following unbranching?

Question 3 What if I want to add multiple learners within each branch? I'd like some of the learners to be inserted with fixed hyperparameters, while for others I'd like to have their hyperparameters tuned with AutoTuner within each branch. Then, I'd like to benchmark them within each branch and select the 'best' from each branch. Finally, I'd like to benchmark the three best learners to end up with the single best.

Many thanks.

Solution

I think that I've found the answer to what I'm looking for. In brief, what I'd like to do is:

Create a graph pipeline with multiple learners. I'd like some of the learners to be inserted with fixed hyperparameters, while for others I'd like to have their hyperparameters tuned. Then, I'd like to benchmark them and select the 'best' one. I'd also like the benchmarking of learners to happen under different class balancing strategies, namely, do nothing, up-sample and down-sample. The optimal parameter settings for the up/down-sampling (e.g. ratio) would also be determined during tuning.

Two examples below, one that almost does what I want, the other doing exactly what I want.

Example 1: Build a pipe that includes all learners, that is, learners with fixed hyperparameters, as well as learners whose hyperparameters require tuning

As will be shown, it seems like a bad idea to have both kinds of learners (i.e. with fixed and tunable hyperparameters), because tuning the pipe disregards the learners with tunable hyperparameters.

####################################################################################
# Build Machine Learning pipeline that:
# 1. Imputes missing values (optional).
# 2. Tunes and benchmarks a range of learners.
# 3. Handles imbalanced data in different ways.
# 4. Identifies optimal learner for the task at hand.

# Abbreviations
# 1. td: Tuned. Learner already tuned with optimal hyperparameters, as found empirically by Probst et al. (2009). See http://jmlr.csail.mit.edu/papers/volume20/18-444/18-444.pdf
# 2. tn: Tuner. Optimal hyperparameters for the learner to be determined within the Tuner.
# 3. raw: Raw dataset in that class imbalances were not treated in any way.
# 4. up: Data upsampling to balance class imbalances.
# 5. down: Data downsampling to balance class imbalances.

# References
# Probst et al. (2009). http://jmlr.csail.mit.edu/papers/volume20/18-444/18-444.pdf
####################################################################################

task <- tsk('sonar')

# Indices for splitting data into training and test sets
train.idx <- task$data() %>%
  select(Class) %>%
  rownames_to_column %>%
  group_by(Class) %>%
  sample_frac(2 / 3) %>% # Stratified sample to maintain proportions between classes.
  ungroup %>%
  select(rowname) %>%
  deframe %>%
  as.numeric
test.idx <- setdiff(seq_len(task$nrow), train.idx)

# Define training and test sets in task format
task_train <- task$clone()$filter(train.idx)
task_test  <- task$clone()$filter(test.idx)

# Define class balancing strategies
class_counts <- table(task_train$truth())
upsample_ratio <- class_counts[class_counts == max(class_counts)] / 
  class_counts[class_counts == min(class_counts)]
downsample_ratio <- 1 / upsample_ratio

# 1. Enrich minority class by factor 'ratio'
po_over <- po("classbalancing", id = "up", adjust = "minor", 
              reference = "minor", shuffle = FALSE, ratio = upsample_ratio)

# 2. Reduce majority class by factor '1/ratio'
po_under <- po("classbalancing", id = "down", adjust = "major", 
               reference = "major", shuffle = FALSE, ratio = downsample_ratio)

# 3. No class balancing
po_raw <- po("nop", id = "raw") # Pipe operator for 'do nothing' ('nop'), i.e. don't up/down-balance the classes.

# We will be using an XGBoost learner throughout with different hyperparameter settings.

# Define XGBoost learner with the optimal hyperparameters of Probst et al.
# Learner will be added to the pipeline later on, in conjuction with and without class balancing.
xgb_td <- lrn("classif.xgboost", predict_type = 'prob')
xgb_td$param_set$values <- list(
  booster = "gbtree", 
  nrounds = 2563, 
  max_depth = 11, 
  min_child_weight = 1.75, 
  subsample = 0.873, 
  eta = 0.052,
  colsample_bytree = 0.713,
  colsample_bylevel = 0.638,
  lambda = 0.101,
  alpha = 0.894
)

xgb_td_raw <- GraphLearner$new(
  po_raw %>>%
    po('learner', xgb_td, id = 'xgb_td'),
  predict_type = 'prob'
)

xgb_tn_raw <- GraphLearner$new(
  po_raw %>>%
    po('learner', lrn("classif.xgboost",
                      predict_type = 'prob'), id = 'xgb_tn'),
  predict_type = 'prob'
)

xgb_td_up <- GraphLearner$new(
  po_over %>>%
    po('learner', xgb_td, id = 'xgb_td'),
  predict_type = 'prob'
)

xgb_tn_up <- GraphLearner$new(
  po_over %>>%
    po('learner', lrn("classif.xgboost",
                      predict_type = 'prob'), id = 'xgb_tn'),
  predict_type = 'prob'
)

xgb_td_down <- GraphLearner$new(
  po_under %>>%
    po('learner', xgb_td, id = 'xgb_td'),
  predict_type = 'prob'
)

xgb_tn_down <- GraphLearner$new(
  po_under %>>%
    po('learner', lrn("classif.xgboost",
                      predict_type = 'prob'), id = 'xgb_tn'),
  predict_type = 'prob'
)

learners_all <- list(
  xgb_td_raw,
  xgb_tn_raw,
  xgb_td_up,
  xgb_tn_up,
  xgb_td_down,
  xgb_tn_down
)
names(learners_all) <- sapply(learners_all, function(x) x$id)

# Create pipeline as a graph. This way, pipeline can be plotted. Pipeline can then be converted into a learner with GraphLearner$new(pipeline).
# Pipeline is a collection of Graph Learners (type ?GraphLearner in the command line for info).
# Each GraphLearner is a td or tn model (see abbreviations above) with or without class balancing.
# Up/down or no sampling happens within each GraphLearner, otherwise an error during tuning indicates that there are >= 2 data sources.
# Up/down or no sampling within each GraphLearner can be specified by chaining the relevant pipe operators (function po(); type ?PipeOp in command line) with the PipeOp of each learner.
graph <- 
  #po("imputehist") %>>% # Optional. Impute missing values only when using classifiers that can't handle them (e.g. Random Forest).
  po("branch", names(learners_all)) %>>%
  gunion(unname(learners_all)) %>>%
  po("unbranch")

graph$plot() # Plot pipeline

pipe <- GraphLearner$new(graph) # Convert pipeline to learner
pipe$predict_type <- 'prob' # Don't forget to specify we want to predict probabilities and not classes.

ps_table <- as.data.table(pipe$param_set)
View(ps_table[, 1:4])

# Set hyperparameter ranges for the tunable learners
ps_xgboost <- ps_table$id %>%
  lapply(
    function(x) {
      if (grepl('_tn', x)) {
        if (grepl('.booster', x)) {
          ParamFct$new(x, levels = "gbtree")
        } else if (grepl('.nrounds', x)) {
          ParamInt$new(x, lower = 100, upper = 110)
        } else if (grepl('.max_depth', x)) {
          ParamInt$new(x, lower = 3, upper = 10)
        } else if (grepl('.min_child_weight', x)) {
          ParamDbl$new(x, lower = 0, upper = 10)
        } else if (grepl('.subsample', x)) {
          ParamDbl$new(x, lower = 0, upper = 1)
        } else if (grepl('.eta', x)) {
          ParamDbl$new(x, lower = 0.1, upper = 0.6)
        } else if (grepl('.colsample_bytree', x)) {
          ParamDbl$new(x, lower = 0.5, upper = 1)
        } else if (grepl('.gamma', x)) {
          ParamDbl$new(x, lower = 0, upper = 5)
        }
      }
    }
  )
ps_xgboost <- Filter(Negate(is.null), ps_xgboost)
ps_xgboost <- ParamSet$new(ps_xgboost)

# Se parameter ranges for the class balancing strategies
ps_class_balancing <- ps_table$id %>%
  lapply(
    function(x) {
      if (all(grepl('up.', x), grepl('.ratio', x))) {
        ParamDbl$new(x, lower = 1, upper = upsample_ratio)
      } else if (all(grepl('down.', x), grepl('.ratio', x))) {
        ParamDbl$new(x, lower = downsample_ratio, upper = 1)
      }
    }
  )
ps_class_balancing <- Filter(Negate(is.null), ps_class_balancing)
ps_class_balancing <- ParamSet$new(ps_class_balancing)

# Define parameter set
param_set <- ParamSetCollection$new(list(
  ParamSet$new(list(pipe$param_set$params$branch.selection$clone())), # ParamFct can be copied.
  ps_xgboost, 
  ps_class_balancing
))

# Add dependencies. For instance, we can only set the mtry value if the pipe is configured to use the Random Forest (ranger).
# In a similar manner, we want do add a dependency between, e.g. hyperparameter "raw.xgb_td.xgb_tn.booster" and branch "raw.xgb_td"
# See https://mlr3gallery.mlr-org.com/tuning-over-multiple-learners/
param_set$ids()[-1] %>%
  lapply(
    function(x) {
      aux <- names(learners_all) %>%
        sapply(
          function(y) {
            grepl(y, x)
          }
        )
      aux <- names(aux[aux])
      param_set$add_dep(x, "branch.selection", 
                        CondEqual$new(aux))
    }
  )

# Set up tuning instance
instance <- TuningInstance$new(
  task = task_train,
  learner = pipe,
  resampling = rsmp('cv', folds = 2),
  measures = msr("classif.bbrier"),
  #measures = prc_micro,
  param_set,
  terminator = term("evals", n_evals = 3))
tuner <- TunerRandomSearch$new()

# Tune pipe learner to find best-performing branch
tuner$tune(instance)

instance$result
instance$archive() 
instance$archive(unnest = "tune_x") # Unnest the tuner search space values

pipe$param_set$values <- instance$result$params
pipe$train(task_train)

pred <- pipe$predict(task_test)
pred$confusion

Note that the tuner chooses to disregard the tuning of the tunable learners and focuses on the tuned learners only. This can be confirmed by inspecting instance$result: the only things that have been tuned for the tunable learners are the class-balancing parameters, which are actually not learner hyperparameters.

Example 2: Build a pipe that includes tunable learners only, find the 'best' one, and then benchmark it against the learners with fixed hyperparameters at a second stage.

Step 1: Build pipe for tunable learners

learners_all <- list(
  #xgb_td_raw,
  xgb_tn_raw,
  #xgb_td_up,
  xgb_tn_up,
  #xgb_td_down,
  xgb_tn_down
)
names(learners_all) <- sapply(learners_all, function(x) x$id)

# Create pipeline as a graph. This way, pipeline can be plotted. Pipeline can then be converted into a learner with GraphLearner$new(pipeline).
# Pipeline is a collection of Graph Learners (type ?GraphLearner in the command line for info).
# Each GraphLearner is a td or tn model (see abbreviations above) with or without class balancing.
# Up/down or no sampling happens within each GraphLearner, otherwise an error during tuning indicates that there are >= 2 data sources.
# Up/down or no sampling within each GraphLearner can be specified by chaining the relevant pipe operators (function po(); type ?PipeOp in command line) with the PipeOp of each learner.
graph <- 
  #po("imputehist") %>>% # Optional. Impute missing values only when using classifiers that can't handle them (e.g. Random Forest).
  po("branch", names(learners_all)) %>>%
  gunion(unname(learners_all)) %>>%
  po("unbranch")

graph$plot() # Plot pipeline

pipe <- GraphLearner$new(graph) # Convert pipeline to learner
pipe$predict_type <- 'prob' # Don't forget to specify we want to predict probabilities and not classes.

ps_table <- as.data.table(pipe$param_set)
View(ps_table[, 1:4])

ps_xgboost <- ps_table$id %>%
  lapply(
    function(x) {
      if (grepl('_tn', x)) {
        if (grepl('.booster', x)) {
          ParamFct$new(x, levels = "gbtree")
        } else if (grepl('.nrounds', x)) {
          ParamInt$new(x, lower = 100, upper = 110)
        } else if (grepl('.max_depth', x)) {
          ParamInt$new(x, lower = 3, upper = 10)
        } else if (grepl('.min_child_weight', x)) {
          ParamDbl$new(x, lower = 0, upper = 10)
        } else if (grepl('.subsample', x)) {
          ParamDbl$new(x, lower = 0, upper = 1)
        } else if (grepl('.eta', x)) {
          ParamDbl$new(x, lower = 0.1, upper = 0.6)
        } else if (grepl('.colsample_bytree', x)) {
          ParamDbl$new(x, lower = 0.5, upper = 1)
        } else if (grepl('.gamma', x)) {
          ParamDbl$new(x, lower = 0, upper = 5)
        }
      }
    }
  )
ps_xgboost <- Filter(Negate(is.null), ps_xgboost)
ps_xgboost <- ParamSet$new(ps_xgboost)

ps_class_balancing <- ps_table$id %>%
  lapply(
    function(x) {
      if (all(grepl('up.', x), grepl('.ratio', x))) {
        ParamDbl$new(x, lower = 1, upper = upsample_ratio)
      } else if (all(grepl('down.', x), grepl('.ratio', x))) {
        ParamDbl$new(x, lower = downsample_ratio, upper = 1)
      }
    }
  )
ps_class_balancing <- Filter(Negate(is.null), ps_class_balancing)
ps_class_balancing <- ParamSet$new(ps_class_balancing)

param_set <- ParamSetCollection$new(list(
  ParamSet$new(list(pipe$param_set$params$branch.selection$clone())), # ParamFct can be copied.
  ps_xgboost, 
  ps_class_balancing
))

# Add dependencies. For instance, we can only set the mtry value if the pipe is configured to use the Random Forest (ranger).
# In a similar manner, we want do add a dependency between, e.g. hyperparameter "raw.xgb_td.xgb_tn.booster" and branch "raw.xgb_td"
# See https://mlr3gallery.mlr-org.com/tuning-over-multiple-learners/
param_set$ids()[-1] %>%
  lapply(
    function(x) {
      aux <- names(learners_all) %>%
        sapply(
          function(y) {
            grepl(y, x)
          }
        )
      aux <- names(aux[aux])
      param_set$add_dep(x, "branch.selection", 
                        CondEqual$new(aux))
    }
  )

# Set up tuning instance
instance <- TuningInstance$new(
  task = task_train,
  learner = pipe,
  resampling = rsmp('cv', folds = 2),
  measures = msr("classif.bbrier"),
  #measures = prc_micro,
  param_set,
  terminator = term("evals", n_evals = 3))
tuner <- TunerRandomSearch$new()

# Tune pipe learner to find best-performing branch
tuner$tune(instance)

instance$result
instance$archive() 
instance$archive(unnest = "tune_x") # Unnest the tuner search space values

pipe$param_set$values <- instance$result$params
pipe$train(task_train)

pred <- pipe$predict(task_test)
pred$confusion

Note that now instance$result returns optimal results for the learners' hyperparameters too, and not just for the class-balancing parameters.

Step 2: Benchmark 'best' tunable learner (now tuned) and the learners that have fixed hyperparameters

# Define re-sampling and instantiate it so always the same split will be used

resampling <- rsmp("cv", folds = 2)

set.seed(123)
resampling$instantiate(task_train)

bmr <- benchmark(
  design = benchmark_grid(
    task_train,
    learner = list(pipe, xgb_td_raw, xgb_td_up, xgb_tn_down),
    resampling
  ),
  store_models = TRUE # Only needed if you want to inspect the models
)

bmr$aggregate(msr("classif.bbrier"))

A few issues to consider

I should have probably created a second, separate pipe for the learners that have fixed hyperparameters, in order to at least have the class-balancing parameters tuned. Then, the two pipes (tunable and fixed hyperparameters) would be benchmarked with benchmark().
I should have probably used the same resampling strategy from beginning to end? I.e., instantiate the reampling strategy right before tuning the first pipe, so that this strategy is also used in the second pipe and in the final benchmark.

Comments/validation more than welcome.

(special thanks to missuse for the constructive comments)