r optimization parallel-processing xgboost tidymodels

Why is my xgboost model tuning abysmally slow?

I am trying to tune xgboost parameters in R and it takes more than a day to complete on a 32GB RAM machine with 8 cores/16 processors. Could someone look at the reprex and suggest improvements or point out things that are nonsensical or wrong? The tune_grid portion is what's running really slowly.

Note: the reprex does not take more than a day to run. It is smaller than mine--I have 230k observations, 115k "groups" (think group = patient ID and each patient has multiple observations), there are more predictors, and the nominal predictors have more classes. At the start I make arbitrary modifications to make the stackoverflow data look slightly more like the real data.

library(tidymodels)
library(modeldata)
library(janitor)
library(parallel)

data(stackoverflow)

set.seed(123)
ids <- 1:nrow(stackoverflow)
groups <- sample(c(state.name, state.abb),
                 size = nrow(stackoverflow),
                 replace = TRUE)
a_fctr <- sample(c('fctr1', 'fctr2', 'fctr3'),
                 size = nrow(stackoverflow),
                 replace = TRUE)

stackoverflow_mod <- stackoverflow %>%
  clean_names() %>%
  bind_cols(tibble(ids, groups, a_fctr)) %>%
  rename(id_num = ids,
         group_nm = groups,
         fctr_nm = a_fctr) %>%
  # if any member of the group contains a graphic designer mark the entire group
  group_by(group_nm) %>%
  mutate(strata_graphic = ifelse(any(graphic_designer == 1), 'Graphic designer', 'None')) %>%
  ungroup()
  

set.seed(123)
ini_split <- group_initial_split(stackoverflow_mod,
                                 prop = 0.8,
                                 group = group_nm,
                                 strata = strata_graphic)
train_split <- training(ini_split)
test_split <- testing(ini_split)

exclude <- c('id_num',
             'group_name',
             'strata_graphic')

recipe_1 <- recipe(remote ~ ., data = train_split) %>%
  update_role(any_of(exclude), new_role = 'extra') %>%
  step_dummy(all_nominal_predictors(), one_hot = TRUE)

xgb_tm <- boost_tree(
  trees = tune(),
  tree_depth = tune(),
  min_n = tune(),
  loss_reduction = tune(),
  sample_size = tune(),
  mtry = tune(),
  learn_rate = tune()
  ) %>%
  set_engine('xgboost') %>%
  set_mode('classification')

xgb_workflow <- workflow() %>%
  add_recipe(recipe_1) %>%
  add_model(xgb_tm)

set.seed(123)
xgb_grid <- grid_latin_hypercube(
  trees(range = c(500, 1500)),
  tree_depth(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(c(0.4, 0.8)),
  finalize(mtry(), prep(recipe_1) %>% bake(train_split)),
  learn_rate(),
  size = 50
)

set.seed(123)
folds_group <- group_vfold_cv(data = train_split,
                              balance = 'groups',
                              v = 10,
                              repeats = 3, 
                              group = group_nm,
                              strata = strata_graphic)

gc()
clust <- makePSOCKcluster((detectCores() - 2))
doParallel::registerDoParallel(clust)

xgb_res <- xgb_workflow %>%
  tune_grid(
    resamples = folds_group,
    grid = xgb_grid,
    control = control_grid(save_pred = TRUE, parallel_over = 'everything'),
    metrics = metric_set(roc_auc, pr_auc)
  )

stopCluster(clust)
foreach::registerDoSEQ()

Is the tuning length normal for the size of the dataset? Am I implementing parallelization wrong? Is my tuning grid set up dumb? Something else?

Solution

With this setup, you are fitting 50 different models (50 different hyper parameters grid values) 30 times (10 fold done 3 times). This results in 1500 fitted models. Even if each model fit took 1 second, you would be looking at a 25-minute run-time. This is of cause assuming that everything is run in parallel.

A trick I like to use is to first run the code without parallel, using control = control_grid(verbose = TRUE). And manually time how long it takes for each model to run. After a couple of models, you stop the process and do some math to determine the total runtime.

When doing this, you should also take a look at how much memory the R session takes when running the code single-threaded. Depending on the size of the data, you much not have enough memory to fit multiple models at the same time, making it so it would be faster to fit everything sigle-threaded.