Search code examples
rtidymodelsr-recipes

Why is step_cut from R's recipes package complaining about a factor in my data frame?


This may be something simple I'm missing. I'm new to recipes.

Below is some code that tries to apply recipes::step_cut to two variables: xpo, then xpr. The prep after the second step_cut errors out, complaining that it found a factor variable xpr, but it doesn't look like a factor. I expected no errors.

Help?

Reproducible test case (just run the code):

library(recipes)

test_data <- data.frame(
  xpo = c(99, NA, NA, 100, 99, NA),
  xpr = c(90, NA, NA, 98, 86, NA),
  target = c(0, 0, 0, 0, 0, 0)
)

recipe_obj <- recipe(as.formula("target ~ ."), data = test_data) %>%
  step_naomit(all_predictors())

var_name <- 'xpo'
cutpoints <- c(9, 97, 98, 99, 101)
recipe_obj <- recipe_obj %>%
  step_cut(
    all_of(var_name),
    breaks = cutpoints,
    id = paste0("cut_", var_name)
  )

prep1 <- prep(recipe_obj, training_data=test_data)
juice1 <- juice(prep1)
bake1 <- bake(prep1, new_data=test_data)

var_name <- 'xpr'
cutpoints <- c(1, 75, 86, 99, 228)
recipe_obj2 <- recipe_obj %>%
  step_cut(
    all_of(var_name),
    breaks = cutpoints,
    id = paste0("cut_", var_name)
  )
prep2 <- prep(recipe_obj2, training_data=test_data)

The error (running the last line to set prep2):

Error in `step_cut()`:
Caused by error in `prep()`:
✖ All columns selected for the step should be double or integer.
• 1 factor variable found: `xpr`

The steps of recipe_obj2 (xpo, then xpr):

> tidy(recipe_obj2)
# A tibble: 3 × 6
  number operation type   trained skip  id          
   <int> <chr>     <chr>  <lgl>   <lgl> <chr>       
1      1 step      naomit FALSE   TRUE  naomit_xyvBB
2      2 step      cut    FALSE   FALSE cut_xpo     
3      3 step      cut    FALSE   FALSE cut_xpr    

A look at juice1, the result of step1 (xpo is a factor, xpr is not):

> glimpse(juice1)
Rows: 3
Columns: 3
$ xpo    <fct> "(98,99]", "(99,101]", "(98,99]"
$ xpr    <dbl> 90, 98, 86
$ target <dbl> 0, 0, 0

Versions:

> R.version.string
[1] "R version 4.3.0 (2023-04-21)"
> packageVersion("recipes")
[1] ‘1.1.0’

Solution

  • Summary

    This is happening because you are reusing the var_name variable.

    To get this to work, use {{var_name}} instead, as in:

    recipe_obj <- recipe_obj %>%
      step_cut(
        all_of({{var_name}}),
        breaks = cutpoints,
        id = paste0("cut_", var_name)
      )
    

    The curly-curly operator evaluates var_name and captures the result for later evaluation.

    Details

    When you just call all_of(var_name) in the first recipe, it doesn't pull in the variable value, it just records "look for var_name when prepping." This means that your final recipe looks like this:

    ── Operations 
    • Removing rows with NA values in:
      all_predictors()
    • Cut numeric for: all_of(var_name)
    • Cut numeric for: all_of(var_name)
    

    but what it tries to do is:

    ── Operations 
    • Removing rows with NA values in:
      xpo and xpr
    • Cut numeric for: xpr
    • Cut numeric for: xpr
    

    which fails, because the first step_cut() generates a factor of xpr and the second step_cut() complains.

    library(recipes)
    
    test_data <- data.frame(
      xpo = c(99, NA, NA, 100, 99, NA),
      xpr = c(90, NA, NA, 98, 86, NA),
      target = c(0, 0, 0, 0, 0, 0)
    )
    
    recipe_obj <- recipe(as.formula("target ~ ."), data = test_data) %>%
      step_naomit(all_predictors())
    
    var_name_1 <- 'xpo'
    cutpoints <- c(9, 97, 98, 99, 101)
    recipe_obj <- recipe_obj %>%
      step_cut(
        all_of(var_name_1),
        breaks = cutpoints,
        id = paste0("cut_", var_name_1)
      )
    
    prep1 <- prep(recipe_obj, training_data=test_data)
    juice1 <- juice(prep1)
    bake1 <- bake(prep1, new_data=test_data)
    
    var_name_2 <- 'xpr'
    cutpoints <- c(1, 75, 86, 99, 228)
    recipe_obj2 <- recipe_obj %>%
      step_cut(
        all_of(var_name_2),
        breaks = cutpoints,
        id = paste0("cut_", var_name_2)
      )
    prep2 <- prep(recipe_obj2, training_data=test_data)
    prep2
    #> 
    #> ── Recipe ──────────────────────────────────────────────────────────────────────
    #> 
    #> ── Inputs
    #> Number of variables by role
    #> outcome:   1
    #> predictor: 2
    #> 
    #> ── Training information
    #> Training data contained 6 data points and 3 incomplete rows.
    #> 
    #> ── Operations
    #> • Removing rows with NA values in: xpo and xpr | Trained
    #> • Cut numeric for: xpo | Trained
    #> • Cut numeric for: xpr | Trained
    
    bake(prep2, NULL)
    #> # A tibble: 3 × 3
    #>   xpo      xpr     target
    #>   <fct>    <fct>    <dbl>
    #> 1 (98,99]  (86,99]      0
    #> 2 (99,101] (86,99]      0
    #> 3 (98,99]  (75,86]      0