Search code examples
rmachine-learningworkflowtidymodels

tidymodels workflow errors when trying to use workflow_map()


Thanks for your help in advance! I'm new to tidymodels (and modeling in general) and am having a hard time identifying what's going wrong to troubleshoot my workflow set up.

I'm running four different models to predict baseball win percentages based on a historical dataset. They are a linear model, elastic net model, random forest model, and XGBoost model. I know all the models work (I have tested them individually), but I am trying to use a workflow to test, cross-validate, and select the best models.

I have two different types of recipes, a basic recipe that includes some hyperparameterization tuning steps (selecting variables, step_zv, step_nzv, step_interact, step_corr, and step_impute_bag) for the random forest and XGBoost models. The linear and elastic net models use a recipe that adds a normalization step.

After setting up my workflows and grids, when I try to run workflow_map(), I get two errors:

  1. "Error in summary.connection(connection) : invalid connection"
  2. "2 arguments have been tagged for tuning in these components: model_spec. Please use one of the tuning functions (e.g. 'tune_grid()') to optimize them"

My questions:

  1. What does the first error indicate?
  2. As for the second, where should I be adding/incorporating tune_grid() into the workflow?

--

For reference, here is some of the relevant code:

Some initial set up

# Split data
team_split <- initial_split(mlb_final)

# Extract training and testing data
team_train <- training(team_split)
team_test <- testing(team_split)

# Resampling strategy
team_rs <- vfold_cv(team_train)

Model specification

# Random forest model 
mlb_forest <- rand_forest(min_n = tune()) %>% 
  set_engine("ranger",
             importance = "permutation") %>% 
  set_mode("regression")

# Linear model
mlb_linear <- linear_reg() %>%
  set_engine("lm") %>%
  set_mode("regression")

# XGBoost
mlb_xgb <- boost_tree(
  trees = tune(),
  min_n = tune(),
  tree_depth = tune(),
  learn_rate = tune()
) %>%
  set_engine("xgboost") %>%
  set_mode("regression")

# Elastic Net
mlb_elastic <- linear_reg(
  penalty = tune(),    
  mixture = tune()     
) %>%
  set_engine("glmnet") %>%
  set_mode("regression")

I've set up my workflows like this:

linear_workflow <- workflow() |> 
  add_model(mlb_linear) |> 
  add_recipe(normalized_recipe)
  
elastic_workflow <- workflow() |> 
  add_model(mlb_elastic) |> 
  add_recipe(normalized_recipe)
  
rf_workflow <- workflow() |> 
  add_model(mlb_forest) |> 
  add_recipe(basic_recipe)

xgb_workflow <- workflow() |> 
  add_model(mlb_xgb) |> 
  add_recipe(basic_recipe)

And my grids like this:

grid_ctrl <- control_grid(
  save_pred = TRUE,
  parallel_over = NULL,
  save_workflow = TRUE,
  verbose = TRUE
)

rf_grid <- grid_regular(
  min_n(range = c(5, 50)),  # Min number of observations per leaf (tuning parameter)
  mtry(range = c(2, 10)),   # Number of variables to randomly sample at each split
  levels = 5                # Levels of grid search
)

xgb_grid <- grid_regular(
  trees(range = c(100, 500)),      
  min_n(range = c(5, 15)),        
  tree_depth(range = c(3, 6)),    
  learn_rate(range = c(0.05, 0.1)), 
  levels = 5
)

elastic_grid <- grid_regular(
  penalty(range = c(-2, 1), trans = log10_trans()),  
  mixture(range = c(0, 1)),                          
  levels = 5
)

linear_grid <- 5

I then combined into normalized and basic workflow sets.

normalized_mlb <- workflow_set(
  preproc = list(normalized = normalized_recipe), 
  models = list(linear = mlb_linear, 
                elastic = mlb_elastic)
  )

basic_mlb <- workflow_set(
  preproc = list(basic = basic_recipe),
  models = list(rf = mlb_forest, 
                xgb = mlb_xgb)
)

And then tried to use workflow_map() for both normalized and basic workflows

lm_models <- normalized_mlb |> 
  workflow_map("fit_resamples",
               seed = 100,
               verbose = TRUE,
               resamples = team_rs, 
               control = grid_ctrl)

basic_models <- basic_mlb  |> 
  workflow_map("fit_resamples",
               seed = 100,
               verbose = TRUE,
               resamples = team_rs, 
               control = grid_ctrl)

The workflows are split into normalized and basic workflows because, initially, I was trying to run them together and running into issues. However, I'm still not sure how to address these errors.


Solution

  • I used some simulated data to try to reproduce the results (and could).

    Some of the workflows have tuning parameters and some don't. workflow_map() has the default argument of fn = "tune_grid" but will fall back to "fit_resamples" if the workflow doesn't have tuning parameters.

    If you take out fn = "tune_grid" from your code, it runs.

    I can't reproduce

    "Error in summary.connection(connection) : invalid connection"

    I assume it is related to parallel processing? If you are working over a remote session, it could be related to a connection problem too.

    One other thing... we won't have an obvious way of adding custom grids (yet). You can do this though:

    basic_models <- basic_mlb  |> 
      workflow_map(seed = 100,         #<- removed "fit_resamples"
                   verbose = TRUE,
                   resamples = team_rs, 
                   control = grid_ctrl) %>% 
      option_add(grid = xgb_grid, id = "basic_xgb") %>% 
      option_add(grid = rf_grid,  id = "basic_rf")