Search code examples
rparallel-processingfuturemlr3

How to speed up resampling process with parallelizaiton in mlr3?


I try to run the resampling process with parallelization in mlr3. But I find that it always slower than the sequential plan. Here is my code and result:

# load the packages
library(mlr3)
library(future)
library(future.apply)
library(tictoc)

# sequential plan
set.seed(100)
tic()
task_train_cv <- resample(
  task = task_train,           # the training data is about 60000 rows and 29 cols
  learner = lrn("classif.ranger", predict_type = "prob"),
  resampling = rsmp("cv", folds = 5),
  store_models = TRUE)
toc()                          # 207.14 sec elapsed

# parallel plan
plan(multisession)
set.seed(100)
tic()
task_train_cv_par <- resample(
  task = task_train,
  learner = lrn("classif.ranger", predict_type = "prob"),
  resampling = rsmp("cv", folds = 5),
  store_models = TRUE)
toc()                          # 268.99 sec elapsed
plan(sequential)

I have tested for many times, with differernt number of workers in the plan(), and running on differnt laptops, the parallel plan is always slower. And it also happens on hyperparameter tuning and nested resampling process. But I can see the sessions are working in the background when I check the task manager in Windows.

Is there something wrong with my parallelization setting in mlr3? Thanks!


Solution

  • The random forest implementation is using threading per default in the current CRAN release of mlr3learners (the default will change in the next release). So you are comparing two parallel executions, and the second one via multisession comes with a slightly larger overhead.