This may be something simple I'm missing. I'm new to recipes.
Below is some code that tries to apply recipes::step_cut
to two variables: xpo
, then xpr
. The prep after the second step_cut errors out, complaining that it found a factor variable xpr, but it doesn't look like a factor. I expected no errors.
Help?
Reproducible test case (just run the code):
library(recipes)
test_data <- data.frame(
xpo = c(99, NA, NA, 100, 99, NA),
xpr = c(90, NA, NA, 98, 86, NA),
target = c(0, 0, 0, 0, 0, 0)
)
recipe_obj <- recipe(as.formula("target ~ ."), data = test_data) %>%
step_naomit(all_predictors())
var_name <- 'xpo'
cutpoints <- c(9, 97, 98, 99, 101)
recipe_obj <- recipe_obj %>%
step_cut(
all_of(var_name),
breaks = cutpoints,
id = paste0("cut_", var_name)
)
prep1 <- prep(recipe_obj, training_data=test_data)
juice1 <- juice(prep1)
bake1 <- bake(prep1, new_data=test_data)
var_name <- 'xpr'
cutpoints <- c(1, 75, 86, 99, 228)
recipe_obj2 <- recipe_obj %>%
step_cut(
all_of(var_name),
breaks = cutpoints,
id = paste0("cut_", var_name)
)
prep2 <- prep(recipe_obj2, training_data=test_data)
The error (running the last line to set prep2):
Error in `step_cut()`:
Caused by error in `prep()`:
✖ All columns selected for the step should be double or integer.
• 1 factor variable found: `xpr`
The steps of recipe_obj2 (xpo, then xpr):
> tidy(recipe_obj2)
# A tibble: 3 × 6
number operation type trained skip id
<int> <chr> <chr> <lgl> <lgl> <chr>
1 1 step naomit FALSE TRUE naomit_xyvBB
2 2 step cut FALSE FALSE cut_xpo
3 3 step cut FALSE FALSE cut_xpr
A look at juice1, the result of step1 (xpo is a factor, xpr is not):
> glimpse(juice1)
Rows: 3
Columns: 3
$ xpo <fct> "(98,99]", "(99,101]", "(98,99]"
$ xpr <dbl> 90, 98, 86
$ target <dbl> 0, 0, 0
Versions:
> R.version.string
[1] "R version 4.3.0 (2023-04-21)"
> packageVersion("recipes")
[1] ‘1.1.0’
This is happening because you are reusing the var_name
variable.
To get this to work, use {{var_name}}
instead, as in:
recipe_obj <- recipe_obj %>%
step_cut(
all_of({{var_name}}),
breaks = cutpoints,
id = paste0("cut_", var_name)
)
The curly-curly operator evaluates var_name and captures the result for later evaluation.
When you just call all_of(var_name)
in the first recipe, it doesn't pull in the variable value, it just records "look for var_name
when prepping." This means that your final recipe looks like this:
── Operations
• Removing rows with NA values in:
all_predictors()
• Cut numeric for: all_of(var_name)
• Cut numeric for: all_of(var_name)
but what it tries to do is:
── Operations
• Removing rows with NA values in:
xpo and xpr
• Cut numeric for: xpr
• Cut numeric for: xpr
which fails, because the first step_cut()
generates a factor of xpr
and the second step_cut()
complains.
library(recipes)
test_data <- data.frame(
xpo = c(99, NA, NA, 100, 99, NA),
xpr = c(90, NA, NA, 98, 86, NA),
target = c(0, 0, 0, 0, 0, 0)
)
recipe_obj <- recipe(as.formula("target ~ ."), data = test_data) %>%
step_naomit(all_predictors())
var_name_1 <- 'xpo'
cutpoints <- c(9, 97, 98, 99, 101)
recipe_obj <- recipe_obj %>%
step_cut(
all_of(var_name_1),
breaks = cutpoints,
id = paste0("cut_", var_name_1)
)
prep1 <- prep(recipe_obj, training_data=test_data)
juice1 <- juice(prep1)
bake1 <- bake(prep1, new_data=test_data)
var_name_2 <- 'xpr'
cutpoints <- c(1, 75, 86, 99, 228)
recipe_obj2 <- recipe_obj %>%
step_cut(
all_of(var_name_2),
breaks = cutpoints,
id = paste0("cut_", var_name_2)
)
prep2 <- prep(recipe_obj2, training_data=test_data)
prep2
#>
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#>
#> ── Inputs
#> Number of variables by role
#> outcome: 1
#> predictor: 2
#>
#> ── Training information
#> Training data contained 6 data points and 3 incomplete rows.
#>
#> ── Operations
#> • Removing rows with NA values in: xpo and xpr | Trained
#> • Cut numeric for: xpo | Trained
#> • Cut numeric for: xpr | Trained
bake(prep2, NULL)
#> # A tibble: 3 × 3
#> xpo xpr target
#> <fct> <fct> <dbl>
#> 1 (98,99] (86,99] 0
#> 2 (99,101] (86,99] 0
#> 3 (98,99] (75,86] 0