Search code examples
rpurrrdummy-variable

Dummy code categorical / ordinal variables in the tidyverse r


Let's say I have a tibble.

library(tidyverse) 
tib <- as.tibble(list(record = c(1:10), 
                      gender = as.factor(sample(c("M", "F"), 10, replace = TRUE)), 
                      like_product = as.factor(sample(1:5, 10, replace = TRUE)))
tib

    # A tibble: 10 x 3
   record gender like_product
    <int> <fctr>       <fctr>
 1      1      F            2
 2      2      M            1
 3      3      M            2
 4      4      F            3
 5      5      F            4
 6      6      M            2
 7      7      F            4
 8      8      M            4
 9      9      F            4
10     10      M            5

I would like to dummy code my data with 1's and 0's so that the data looks more/less like this.

# A tibble: 10 x 8
   record gender_M gender_F like_product_1 like_product_2 like_product_3 like_product_4 like_product_5
    <int>    <dbl>    <dbl>          <dbl>          <dbl>          <dbl>          <dbl>          <dbl>
 1      1        0        1              0              0              1              0              0
 2      2        0        1              0              0              0              0              0
 3      3        0        1              0              1              0              0              0
 4      4        0        1              1              0              0              0              0
 5      5        1        0              0              0              0              0              0
 6      6        0        1              0              0              0              0              0
 7      7        0        1              0              0              0              0              0
 8      8        0        1              0              1              0              0              0
 9      9        1        0              0              0              0              0              0
10     10        1        0              0              0              0              0              1

My workflow would require that I know a range of variables to dummy code (i.e. gender:like_product), but don't want to identify EVERY variable by hand (there could be hundreds of variables). Likewise, I don't want to have to identify every level/unique value of every variable to dummy code. I'm ultimately looking for a tidyverse solution.

I know of several ways of doing this, but none of them that fit perfectly within tidyverse. I know I could use mutate...

tib %>%
     mutate(gender_M = ifelse(gender == "M", 1, 0), 
            gender_F = ifelse(gender == "F", 1, 0), 
            like_product_1 = ifelse(like_product == 1, 1, 0), 
            like_product_2 = ifelse(like_product == 2, 1, 0), 
            like_product_3 = ifelse(like_product == 3, 1, 0), 
            like_product_4 = ifelse(like_product == 4, 1, 0), 
            like_product_5 = ifelse(like_product == 5, 1, 0)) %>%
     select(-gender, -like_product)

But this would break my workflow rules of needing to specify every dummy coded output.

I've done this in the past with model.matrix, from the stats package.

model.matrix(~ gender + like_product, tib) 

Easy and straightforward, but I want a solution in the tidyverse. EDIT: Reason being, I still have to specify every variable, and being able to use select helpers to specify something like gender:like_product would be much preferred.

I think the solution is in purrr

library(purrr)
dummy_code <- function(x) {
     lvls <- levels(x)
     sapply(lvls, function(y) as.integer(x == y)) %>% as.tibble
} 

tib %>%
     map_at(c("gender", "like_product"), dummy_code)

$record
 [1]  1  2  3  4  5  6  7  8  9 10

$gender
# A tibble: 10 x 2
       F     M
   <int> <int>
 1     1     0
 2     0     1
 3     0     1
 4     1     0
 5     1     0
 6     0     1
 7     1     0
 8     0     1
 9     1     0
10     0     1

$like_product
# A tibble: 10 x 5
     `1`   `2`   `3`   `4`   `5`
   <int> <int> <int> <int> <int>
 1     0     1     0     0     0
 2     1     0     0     0     0
 3     0     1     0     0     0
 4     0     0     1     0     0
 5     0     0     0     1     0
 6     0     1     0     0     0
 7     0     0     0     1     0
 8     0     0     0     1     0
 9     0     0     0     1     0
10     0     0     0     0     1

This attempt produces a list of tibbles, with the exception of the excluded variable record, and I've been unsuccessful at combining them all back into a single tibble. Additionally, I still have to specify every column, and overall it seems clunky.

Any better ideas? Thanks!!


Solution

  • An alternative to model.matrix is using the package recipes. This is still a work in progress and is not yet included in the tidyverse. At some point it might / will be included in the tidyverse packages.

    I will leave it up to you to read up on recipes, but in the step step_dummy you can use special selectors from the tidyselect package (installed with recipes) like the selectors you can use in dplyr as starts_with(). I created a little example to show the steps.

    Example code below.

    But if this is handier I will leave up to you as this has already been pointed out in the comments. The function bake() uses model.matrix to create the dummies. The difference is mostly in the column names and of course in the internal checks that are being done in the underlying code of all the separate steps.

    library(recipes)
    library(tibble)
    
    tib <- as.tibble(list(record = c(1:10), 
                          gender = as.factor(sample(c("M", "F"), 10, replace = TRUE)), 
                          like_product = as.factor(sample(1:5, 10, replace = TRUE))))
    
    dum <- tib %>% 
      recipe(~ .) %>% 
      step_dummy(gender, like_product) %>% 
      prep(training = tib) %>% 
      bake(newdata = tib)
    
    dum
    
    # A tibble: 10 x 6
       record gender_M like_product_X2 like_product_X3 like_product_X4 like_product_X5
        <int>    <dbl>           <dbl>           <dbl>           <dbl>           <dbl>
     1      1       1.              1.              0.              0.              0.
     2      2       1.              1.              0.              0.              0.
     3      3       1.              1.              0.              0.              0.
     4      4       0.              0.              1.              0.              0.
     5      5       0.              0.              0.              0.              0.
     6      6       0.              1.              0.              0.              0.
     7      7       0.              1.              0.              0.              0.
     8      8       0.              0.              0.              1.              0.
     9      9       0.              0.              0.              0.              1.
    10     10       1.              0.              0.              0.              0.