Search code examples
rtidymodelsr-recipes

tidymodels recipes: can I use step_dummy() to one-hot encode the categorical variabes *except* booleans which only needs 1 dummy?


If a categorical variable has more than 2 values (like marital status= single/married/widowed/separated/divorced), then I need to create N dummies, one for each of the possible levels. This is done using step_dummy(one_hot = TRUE).

However, if the category is binary (pokemon_fan = "yes"/"no") then I only need to create a single dummy called "pokemon_fan_yes". This is done using step_dummy(one_hot = FALSE).

Is it possible for step_dummy to count the number of levels and proceed differently depending on that number?

thanks.


Solution

  • There is no automatic way to do this within recipes itself, but I think you can create a function that will handle this for you, something like this:

    library(recipes)
    #> Loading required package: dplyr
    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
    #> 
    #> Attaching package: 'recipes'
    #> The following object is masked from 'package:stats':
    #> 
    #>     step
    
    data(crickets, package = "modeldata")
    
    levels_more_than <- function(vec, num = 2) {
      n_distinct(levels(vec)) > num
    }
    
    recipe(~ ., data = crickets) %>%
      step_dummy(species, one_hot = !! levels_more_than(crickets$species)) %>%
      prep() %>%
      bake(new_data = NULL)
    #> # A tibble: 31 × 3
    #>     temp  rate species_O..niveus
    #>    <dbl> <dbl>             <dbl>
    #>  1  20.8  67.9                 0
    #>  2  20.8  65.1                 0
    #>  3  24    77.3                 0
    #>  4  24    78.7                 0
    #>  5  24    79.4                 0
    #>  6  24    80.4                 0
    #>  7  26.2  85.8                 0
    #>  8  26.2  86.6                 0
    #>  9  26.2  87.5                 0
    #> 10  26.2  89.1                 0
    #> # … with 21 more rows
    
    recipe(~ ., data = iris) %>%
      step_dummy(Species, one_hot = !! levels_more_than(iris$Species)) %>%
      prep() %>%
      bake(new_data = NULL)
    #> # A tibble: 150 × 7
    #>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species_setosa
    #>           <dbl>       <dbl>        <dbl>       <dbl>          <dbl>
    #>  1          5.1         3.5          1.4         0.2              1
    #>  2          4.9         3            1.4         0.2              1
    #>  3          4.7         3.2          1.3         0.2              1
    #>  4          4.6         3.1          1.5         0.2              1
    #>  5          5           3.6          1.4         0.2              1
    #>  6          5.4         3.9          1.7         0.4              1
    #>  7          4.6         3.4          1.4         0.3              1
    #>  8          5           3.4          1.5         0.2              1
    #>  9          4.4         2.9          1.4         0.2              1
    #> 10          4.9         3.1          1.5         0.1              1
    #> # … with 140 more rows, and 2 more variables: Species_versicolor <dbl>,
    #> #   Species_virginica <dbl>
    

    Created on 2022-02-23 by the reprex package (v2.0.1)

    Here are some tips for using not-quite-standard selectors in recipes.