Is there a reason the recipe
code snippet for xgboost classifier has one_hot = TRUE
? This creates "n" dummy variables instead of "n-1". I usually set it to FALSE but just want to make sure I'm not missing something.
Code -
data <- mtcars %>%
as_tibble() %>%
mutate(cyl = cyl %>% as.factor)
usemodels::use_xgboost(mpg ~ cyl, data = data)
Output -
xgboost_recipe <-
recipe(formula = mpg ~ cyl, data = data) %>%
step_novel(all_nominal(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%
xgboost_spec <-
boost_tree(trees = tune(), min_n = tune(), tree_depth = tune(), learn_rate = tune(),
loss_reduction = tune(), sample_size = tune()) %>%
set_mode("regression") %>%
xgboost_workflow <-
workflow() %>%
add_recipe(xgboost_recipe) %>%
xgboost_tune <-
tune_grid(xgboost_workflow, resamples = stop("add your rsample object"), grid = stop("add number of candidate points"))
The idea there is that, as a tree-based model, xgboost can handle all the levels (unlike a linear model) and can actually require more splits to fit well if you don't include all the categories. Read more about this here.
You don't see the same for the ranger random forest because it can handle factors natively.
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> filter, lag
#> The following objects are masked from 'package:base':
#> intersect, setdiff, setequal, union
cars <- as_tibble(mtcars) %>%
mutate(cyl = cyl %>% as.factor)
usemodels::use_ranger(mpg ~ cyl, data = cars)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
#> ranger_recipe <-
#> recipe(formula = mpg ~ cyl, data = cars)
#> ranger_spec <-
#> rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>%
#> set_mode("regression") %>%
#> set_engine("ranger")
#> ranger_workflow <-
#> workflow() %>%
#> add_recipe(ranger_recipe) %>%
#> add_model(ranger_spec)
#> set.seed(54153)
#> ranger_tune <-
#> tune_grid(ranger_workflow, resamples = stop("add your rsample object"), grid = stop("add number of candidate points"))
