I'm trying to figure out how step_dummy()
from recipes
package wrangles the data. Although there's a reference page for this function, I'm still unable to wrap my head around how to do it using "regular" tidyverse
tools I know. Here's some code based on recipes
and rsample
packages. I would like to achieve the same data output but just using dplyr
/tidyr
tools.
I chose diamonds
dataset from ggplot2
for this demonstration.
library(rsample)
library(recipes)
my_diamonds <- diamonds[, c("carat", "cut", "price")]
init_split <- initial_split(my_diamonds, prop = .1)
d_training <- training(init_split)
d_training_dummied_using_recipe <-
recipe(formula = price ~ ., data = d_training) %>%
step_dummy(all_nominal()) %>%
prep() %>%
bake(new_data = NULL) # equivalent to `juice()`. It means to get the training data (`d_training`) after the steps in the recipe were applied to it.
d_training_dummied_using_recipe
#> # A tibble: 5,394 x 6
#> carat price cut_1 cut_2 cut_3 cut_4
#> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 0.5 1678 -0.316 -0.267 6.32e- 1 -0.478
#> 2 0.7 2608 -0.316 -0.267 6.32e- 1 -0.478
#> 3 1.7 9996 0.316 -0.267 -6.32e- 1 -0.478
#> 4 0.73 1824 0.316 -0.267 -6.32e- 1 -0.478
#> 5 0.4 988 0.632 0.535 3.16e- 1 0.120
#> 6 1.04 4240 0.316 -0.267 -6.32e- 1 -0.478
#> 7 0.9 3950 0 -0.535 -4.10e-16 0.717
#> 8 0.4 1116 0 -0.535 -4.10e-16 0.717
#> 9 1.34 10070 0.632 0.535 3.16e- 1 0.120
#> 10 0.6 806 0.316 -0.267 -6.32e- 1 -0.478
#> # ... with 5,384 more rows
My question is how, given d_training
, we could arrive an output identical to d_training_dummied_using_recipe
, by using dplyr
or tidyr
(and potentially forcats
) functions? I've seen posts such as this one, but they don't seem to fit the current case.
EDIT
Clearly, step_dummy()
operates only on cut
column, and this is because we specified all_nominal()
. Indeed, cut
is the only nominal variable in d_training
. I thought that cut_*
columns correspond to the levels of cut
, but then I ran:
levels(d_training$cut)
#> [1] "Fair" "Good" "Very Good" "Premium" "Ideal"
which shows 6 levels, whereas there are only 4 cut_*
columns. So this is one limitation to understanding what's going on.
In addition, how those values in cut_*
are generated?
EDIT 2
I've come across the most relevant vignette How are categorical predictors handled in recipes? and it discusses the topic directly.
A contrast function in R is a method for translating a column with categorical values into one or more numeric columns that take the place of the original. This can also be known as an encoding method or a parameterization function.
The default approach is to create dummy variables using the “reference cell” parameterization. This means that, if there are C levels of the factor, there will be C - 1 dummy variables created and all but the first factor level are made into new columns
Regarding the number of levels vs. number of cut_*
columns, the vignette says explicitly:
Note that the column names do not reference a specific level of the [...] variable. This contrast function has columns that can involve multiple levels; level-specific columns wouldn’t make sense.
But ultimately there's no example how to carry the same operation with regular tools (that are not within recipes
context). So my original question remains unsolved.
You can look at the source code for step_dummy()
; I'm not sure that I would call it a black box per se. Notice that during bake()
, it uses model.matrix()
from base R.
library(rsample)
library(recipes)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
data(diamonds, package = "ggplot2")
my_diamonds <- diamonds[, c("carat", "cut", "price")]
init_split <- initial_split(my_diamonds, prop = .1)
d_training <- training(init_split)
d_training_dummied_using_recipe <-
recipe(formula = price ~ ., data = d_training) %>%
step_dummy(all_nominal()) %>%
prep() %>%
bake(new_data = NULL)
d_training_dummied_using_recipe
#> # A tibble: 5,394 × 6
#> carat price cut_1 cut_2 cut_3 cut_4
#> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 0.31 544 -3.16e- 1 -0.267 6.32e- 1 -0.478
#> 2 0.72 3294 6.32e- 1 0.535 3.16e- 1 0.120
#> 3 0.7 2257 -1.48e-18 -0.535 -3.89e-16 0.717
#> 4 0.5 1446 6.32e- 1 0.535 3.16e- 1 0.120
#> 5 0.31 772 6.32e- 1 0.535 3.16e- 1 0.120
#> 6 1.01 3733 3.16e- 1 -0.267 -6.32e- 1 -0.478
#> 7 0.31 942 6.32e- 1 0.535 3.16e- 1 0.120
#> 8 0.43 903 -3.16e- 1 -0.267 6.32e- 1 -0.478
#> 9 1.21 4391 3.16e- 1 -0.267 -6.32e- 1 -0.478
#> 10 1.37 5370 3.16e- 1 -0.267 -6.32e- 1 -0.478
#> # … with 5,384 more rows
model.matrix(price ~ .,
data = d_training) %>%
as_tibble()
#> # A tibble: 5,394 × 6
#> `(Intercept)` carat cut.L cut.Q cut.C `cut^4`
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.31 -3.16e- 1 -0.267 6.32e- 1 -0.478
#> 2 1 0.72 6.32e- 1 0.535 3.16e- 1 0.120
#> 3 1 0.7 -1.48e-18 -0.535 -3.89e-16 0.717
#> 4 1 0.5 6.32e- 1 0.535 3.16e- 1 0.120
#> 5 1 0.31 6.32e- 1 0.535 3.16e- 1 0.120
#> 6 1 1.01 3.16e- 1 -0.267 -6.32e- 1 -0.478
#> 7 1 0.31 6.32e- 1 0.535 3.16e- 1 0.120
#> 8 1 0.43 -3.16e- 1 -0.267 6.32e- 1 -0.478
#> 9 1 1.21 3.16e- 1 -0.267 -6.32e- 1 -0.478
#> 10 1 1.37 3.16e- 1 -0.267 -6.32e- 1 -0.478
#> # … with 5,384 more rows
Created on 2021-12-30 by the reprex package (v2.0.1)
The recipes implementation of creating these indicator variables sets up some protection and convenience around learning from training data and applying to new or testing data, as well as more standard naming, etc. This may be a particularly confusing example because cut
is an ordered factor.