I have a simple recipe to train a model. My categorical variables are changing over time and sometimes I want a numerical to be treated as categorical (postal code) , so I define a list prior to recipe containing them. (just for the sake of the argument, the list is much longer)
recipe worked ok, and then trained my model (3 folds) but an error is raised.
is there any proper way of passing a list of variables to a recipe not crashing the model?
mtcars1 <- mtcars
mtcars1 %<>% dplyr::mutate(new1 = sample.int(200, 32, replace = TRUE),
new2 = sample.int(100, 32, replace = TRUE),
new3 = sample.int(50, 32, replace = TRUE))
my_categorical <- c("new1", "new2", "new3")
mtcars_split <- initial_split(mtcars1, strata = drat)
train <- training(mtcars_split)
test <- testing(mtcars_split)
recipe <-
recipes::recipe(drat ~ ., data = train) %>%
recipes::step_mutate_at(all_of(my_categorical), fn = ~as.character(.)) %>%
recipes::step_string2factor(all_of(my_categorical)) %>%
cv_folds <-
v = 3,
strata = drat)
xgboost_model <-
mode = "classification",
trees = 100,
min_n = tune(),
tree_depth = tune(),
learn_rate = tune(),
loss_reduction = tune(),
mtry = tune()
) %>%
set_engine("xgboost") %>%
xgboost_workflow <-
workflows::workflow() %>%
add_recipe(recipe) %>%
xgboost_grid <-
parameters(xgboost_model) %>%
finalize(select(training(mtcars_split), -drat)) %>%
grid_max_entropy(size = 100)
model_metrics <- yardstick::metric_set(gain_capture,roc_auc)
xgboost_tuned <-
object = xgboost_workflow,
resamples = cv_folds,
grid = xgboost_grid,
metrics = model_metrics,
control = tune::control_grid(save_pred = TRUE, save_workflow = TRUE)
You definitely were passing the vector of variables correctly to the recipe -- no problem there!
You were running into other problems with your model fitting. An xgboost model requires all predictors to be numeric, so if you convert something like zip code to factors, you need to then use step_dummy()
. If you have something of high cardinality like zip codes, you probably will need to handle new levels or unknown levels as well.
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
mtcars1 <- mtcars
mtcars1 %<>% dplyr::mutate(new1 = sample.int(10, 32, replace = TRUE),
new2 = sample.int(5, 32, replace = TRUE))
my_categorical <- c("new1", "new2")
mtcars_split <- initial_split(mtcars1)
train <- training(mtcars_split)
test <- testing(mtcars_split)
cv_folds <- vfold_cv(train, v = 3)
rec <-
recipe(drat ~ ., data = train) %>%
step_mutate_at(all_of(my_categorical), fn = ~as.character(.)) %>%
step_string2factor(all_of(my_categorical)) %>%
step_novel(all_nominal_predictors()) %>%
step_unknown(all_nominal_predictors()) %>%
xgboost_model <-
mode = "classification",
trees = tune()
) %>%
set_engine("xgboost") %>%
xgboost_workflow <-
workflow() %>%
add_recipe(rec) %>%
object = xgboost_workflow,
resamples = cv_folds,
grid = 5
#> # Tuning results
#> # 3-fold cross-validation
#> # A tibble: 3 x 4
#> splits id .metrics .notes
#> <list> <chr> <list> <list>
#> 1 <split [16/8]> Fold1 <tibble [10 × 5]> <tibble [0 × 1]>
#> 2 <split [16/8]> Fold2 <tibble [10 × 5]> <tibble [0 × 1]>
#> 3 <split [16/8]> Fold3 <tibble [10 × 5]> <tibble [0 × 1]>
Created on 2021-06-25 by the reprex package (v2.0.0)
I had to change a few other things in your example to get this to run, like using "regression"
since drat
is numeric, etc. I recommend checking out the reprex package so you run an example like this in a fresh R session and more effectively get help.