Search code examples
rdplyrtidyrdummy-variabler-recipes

How to translate `recipes::step_dummy()` to `dplyr`/`tidyr` code?


I'm trying to figure out how step_dummy() from recipes package wrangles the data. Although there's a reference page for this function, I'm still unable to wrap my head around how to do it using "regular" tidyverse tools I know. Here's some code based on recipes and rsample packages. I would like to achieve the same data output but just using dplyr/tidyr tools.

I chose diamonds dataset from ggplot2 for this demonstration.

library(rsample)
library(recipes)

my_diamonds <- diamonds[, c("carat", "cut", "price")]
init_split  <- initial_split(my_diamonds, prop = .1)
d_training  <- training(init_split)

d_training_dummied_using_recipe <-
  recipe(formula = price ~ ., data = d_training) %>%
  step_dummy(all_nominal()) %>% 
  prep() %>%
  bake(new_data = NULL) # equivalent to `juice()`. It means to get the training data (`d_training`) after the steps in the recipe were applied to it.

d_training_dummied_using_recipe
#> # A tibble: 5,394 x 6
#>    carat price  cut_1  cut_2     cut_3  cut_4
#>    <dbl> <int>  <dbl>  <dbl>     <dbl>  <dbl>
#>  1  0.5   1678 -0.316 -0.267  6.32e- 1 -0.478
#>  2  0.7   2608 -0.316 -0.267  6.32e- 1 -0.478
#>  3  1.7   9996  0.316 -0.267 -6.32e- 1 -0.478
#>  4  0.73  1824  0.316 -0.267 -6.32e- 1 -0.478
#>  5  0.4    988  0.632  0.535  3.16e- 1  0.120
#>  6  1.04  4240  0.316 -0.267 -6.32e- 1 -0.478
#>  7  0.9   3950  0     -0.535 -4.10e-16  0.717
#>  8  0.4   1116  0     -0.535 -4.10e-16  0.717
#>  9  1.34 10070  0.632  0.535  3.16e- 1  0.120
#> 10  0.6    806  0.316 -0.267 -6.32e- 1 -0.478
#> # ... with 5,384 more rows

My question is how, given d_training, we could arrive an output identical to d_training_dummied_using_recipe, by using dplyr or tidyr (and potentially forcats) functions? I've seen posts such as this one, but they don't seem to fit the current case.


EDIT


Clearly, step_dummy() operates only on cut column, and this is because we specified all_nominal(). Indeed, cut is the only nominal variable in d_training. I thought that cut_* columns correspond to the levels of cut, but then I ran:

levels(d_training$cut)
#> [1] "Fair"      "Good"      "Very Good" "Premium"   "Ideal"  

which shows 6 levels, whereas there are only 4 cut_* columns. So this is one limitation to understanding what's going on. In addition, how those values in cut_* are generated?


EDIT 2


I've come across the most relevant vignette How are categorical predictors handled in recipes? and it discusses the topic directly.

A contrast function in R is a method for translating a column with categorical values into one or more numeric columns that take the place of the original. This can also be known as an encoding method or a parameterization function.

The default approach is to create dummy variables using the “reference cell” parameterization. This means that, if there are C levels of the factor, there will be C - 1 dummy variables created and all but the first factor level are made into new columns

Regarding the number of levels vs. number of cut_* columns, the vignette says explicitly:

Note that the column names do not reference a specific level of the [...] variable. This contrast function has columns that can involve multiple levels; level-specific columns wouldn’t make sense.

But ultimately there's no example how to carry the same operation with regular tools (that are not within recipes context). So my original question remains unsolved.


Solution

  • You can look at the source code for step_dummy(); I'm not sure that I would call it a black box per se. Notice that during bake(), it uses model.matrix() from base R.

    library(rsample)
    library(recipes)
    #> Loading required package: dplyr
    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
    #> 
    #> Attaching package: 'recipes'
    #> The following object is masked from 'package:stats':
    #> 
    #>     step
    
    data(diamonds, package = "ggplot2")
    
    my_diamonds <- diamonds[, c("carat", "cut", "price")]
    init_split  <- initial_split(my_diamonds, prop = .1)
    d_training  <- training(init_split)
    
    d_training_dummied_using_recipe <-
      recipe(formula = price ~ ., data = d_training) %>%
      step_dummy(all_nominal()) %>% 
      prep() %>%
      bake(new_data = NULL) 
    
    d_training_dummied_using_recipe
    #> # A tibble: 5,394 × 6
    #>    carat price     cut_1  cut_2     cut_3  cut_4
    #>    <dbl> <int>     <dbl>  <dbl>     <dbl>  <dbl>
    #>  1  0.31   544 -3.16e- 1 -0.267  6.32e- 1 -0.478
    #>  2  0.72  3294  6.32e- 1  0.535  3.16e- 1  0.120
    #>  3  0.7   2257 -1.48e-18 -0.535 -3.89e-16  0.717
    #>  4  0.5   1446  6.32e- 1  0.535  3.16e- 1  0.120
    #>  5  0.31   772  6.32e- 1  0.535  3.16e- 1  0.120
    #>  6  1.01  3733  3.16e- 1 -0.267 -6.32e- 1 -0.478
    #>  7  0.31   942  6.32e- 1  0.535  3.16e- 1  0.120
    #>  8  0.43   903 -3.16e- 1 -0.267  6.32e- 1 -0.478
    #>  9  1.21  4391  3.16e- 1 -0.267 -6.32e- 1 -0.478
    #> 10  1.37  5370  3.16e- 1 -0.267 -6.32e- 1 -0.478
    #> # … with 5,384 more rows
    
    
    model.matrix(price ~ .,
                 data = d_training) %>%
      as_tibble()
    #> # A tibble: 5,394 × 6
    #>    `(Intercept)` carat     cut.L  cut.Q     cut.C `cut^4`
    #>            <dbl> <dbl>     <dbl>  <dbl>     <dbl>   <dbl>
    #>  1             1  0.31 -3.16e- 1 -0.267  6.32e- 1  -0.478
    #>  2             1  0.72  6.32e- 1  0.535  3.16e- 1   0.120
    #>  3             1  0.7  -1.48e-18 -0.535 -3.89e-16   0.717
    #>  4             1  0.5   6.32e- 1  0.535  3.16e- 1   0.120
    #>  5             1  0.31  6.32e- 1  0.535  3.16e- 1   0.120
    #>  6             1  1.01  3.16e- 1 -0.267 -6.32e- 1  -0.478
    #>  7             1  0.31  6.32e- 1  0.535  3.16e- 1   0.120
    #>  8             1  0.43 -3.16e- 1 -0.267  6.32e- 1  -0.478
    #>  9             1  1.21  3.16e- 1 -0.267 -6.32e- 1  -0.478
    #> 10             1  1.37  3.16e- 1 -0.267 -6.32e- 1  -0.478
    #> # … with 5,384 more rows
    

    Created on 2021-12-30 by the reprex package (v2.0.1)

    The recipes implementation of creating these indicator variables sets up some protection and convenience around learning from training data and applying to new or testing data, as well as more standard naming, etc. This may be a particularly confusing example because cut is an ordered factor.