Search code examples
rtidyversepurrrrecode

bunch recoding of variables in the tidyverse (functional / meta-programing)


I want to recode a bunch of variables with as few function calls as possible. I have one data.frame where I want to recode a number of variables. I create a named list of all variable names and the recoding arguments I want to execute. Here I have no problem using map and dpylr. However, when it comes to recoding I find it much easier using recode from the car package, instead of dpylr's own recoding function. A side question is whether there is a nice way of doing the same thing with dplyr::recode.

As a next step I break the data.frame down into a nested tibble. Here I want to do specific recodings in each subset. This is where things get complicated and I am not able to do this in a dpylr pipe anymore. The only thing I get working is a very ugly nested for loop.

Looking for ideas to do this in a nice and clean way.

Lets start with the easy example:

library(carData)
library(dplyr)
library(purrr)
library(tidyr)

# global recode list
recode_ls = list(

  mar = "'not married' = 0;
          'married' = 1",

  wexp = "'no' = 0;
          'yes' = 1"
)

recode_vars <- names(Rossi)[names(Rossi) %in% names(recode_ls)]

Rossi2 <- Rossi # lets save results under a different name

Rossi2[,recode_vars] <- recode_vars %>% map(~ car::recode(Rossi[[.x]],
                                                          recode_ls[.x],
                                                          as.factor = FALSE,
                                                          as.numeric = TRUE))

So far this seems pretty clean to me, apart from the fact that car::recode is much easier to use than dplyr::recode.

Here comes my actual problem. What I am trying to do is recode (in this easy example) the variables mar and wexp differently in each tibble subset. In my real data set the variables I want to recode in each subset are many more and have different names too. Does anyone have a good idea how to do this nice and clean using a dpylr pipe and map?

    nested_rossi <- as_tibble(Rossi) %>% nest(-race)

    recode_wexp_ls = list(

      no = list(

      mar = "'not married' = 0;
             'married' = 1",

      wexp = "'no' = 0;
              'yes' = 1"
      ),

      yes = list(
        mar = "'not married' = 1;
               'married' = 2",

        wexp = "'no' = 1;
                'yes' = 2"
      )

We could also attach the list to the nested data.frame, but I'm not sure if this would make things more efficient.

nested_rossi$recode = list(

          no = list(

          mar = "'not married' = 0;
                 'married' = 1",

          wexp = "'no' = 0;
                  'yes' = 1"
          ),

          yes = list(
            mar = "'not married' = 1;
                   'married' = 2",

            wexp = "'no' = 1;
                    'yes' = 2"
          )
        )

Solution

  • Thanks for a cool question! This is a great chance to use all the power of metaprogramming.

    First, let's examine the recode() function. It gets a vector and an arbitrary number of (named) arguments and returns the same vector with values replaced with function arguments:

    x <- c("a", "b", "c")
    recode(x, a = "Z", c = "X")
    
    #> [1] "Z" "b" "X"
    

    recode's help says that we can use unquote splicing (!!!) to pass a named list into it.

    x_codes <- list(a = "Z", c = "X")
    recode(x, !!!x_codes)
    
    #> [1] "Z" "b" "X"
    

    This ability may be used when mutating a data frame. Suggesting, we have a subset of Rossi dataset:

    library(carData)
    library(tidyverse)
    
    rossi <- Rossi %>% 
      as_tibble() %>% 
      select(mar, wexp)
    

    To mutate two variables in a single function call we can use this snippet (note that both named arguments and unquote splicing approaches work well):

    mar_codes <- list(`not married` = 0, married = 1)
    wexp_codes <- list(no = 0, yes = 1)
    
    rossi %>% 
      mutate(
        mar_code = recode(mar, "not married" = 0, "married" = 1),
        wexp_code = recode(wexp, !!!wexp_codes)
      )
    
    #> # A tibble: 432 x 4
    #>    mar         wexp  mar_code wexp_code
    #>    <fct>       <fct>    <dbl>     <dbl>
    #>  1 not married no           0         0
    #>  2 not married no           0         0
    #>  3 not married yes          0         1
    #>  4 married     yes          1         1
    #>  5 not married yes          0         1
    

    So, unquote splicing is a good method to pass multiple arguments into a function in a non-standard evaluation environment.

    Now suggest we have a list of lists of codes:

    mapping <- list(mar = mar_codes, wexp = wexp_codes)
    mapping
    
    #> $mar
    #> $mar$`not married`
    #> [1] 0
    
    #> $mar$married
    #> [1] 1
    
    #> $wexp
    #> $wexp$no
    #> [1] 0
    
    #> $wexp$yes
    #> [1] 1
    

    What we need is to transform this list to list of expressions to place inside mutate():

    expressions <- mapping %>% 
      imap(
        ~ quo(
          recode(!!sym(.y), !!!.x)
        )
      )
    
    expressions
    
    #> $mar
    #> <quosure>
    #> expr: ^recode(mar, not married = 0, married = 1)
    #> env:  0x7fbf374513c0
    
    #> $wexp
    #> <quosure>
    #> expr: ^recode(wexp, no = 0, yes = 1)
    #> env:  0x7fbf37453468
    

    The last step. Pass this list of expressions inside the mutate and see what it will do:

    mutate(rossi, !!!expressions)
    
    #> # A tibble: 432 x 2
    #>      mar  wexp
    #>    <dbl> <dbl>
    #>  1     0     0
    #>  2     0     0
    #>  3     0     1
    #>  4     1     1
    #>  5     0     1
    

    Now you can widen your lists of variables to recode, handle several lists at once and so on.

    With such a powerful technique (metaprogramming) you can do amazing things. I strongly recommend you delve into this theme. And there is no better resource to start than Hadley Wickham's Advanced R book.

    Hope, it's what you have been looking for.

    Update

    Diving deeper. The question was: how to apply this technique to a tibble-column?

    Let's create nested tibble of group and df (our data to recode)

    rossi <- 
      head(Rossi, 5) %>% 
      as_tibble() %>% 
      select(mar, wexp)
    
    nested <- tibble(group = c("yes", "no"), df = list(rossi))
    

    nested looks like:

    # A tibble: 2 x 2
      group df              
      <chr> <list>          
    1 yes   <tibble [5 × 2]>
    2 no    <tibble [5 × 2]>
    

    We already know how to build a list of expressions from the list of codes. Let's create a function to handle it for us.

    build_recode_expressions <- function(list_of_codes) {
      imap(list_of_codes, ~ quo(recode(!!sym(.y), !!!.x)))
    }
    

    There, list_of_codes argument is a named list for each variable needed to recode.

    Assuming, we have a list of multiple recodings codes, we can transform it into the list of multiple lists of expressions. The number of variables in each list may be arbitrary.

    codes <- list(
      yes = list(mar = list(`not married` = 0, married = 1)),
      no = list(
        mar = list(`not married` = 10, married = 20), 
        wexp = list(no = "NOOOO", yes = "YEEEES")
      )
    )
    
    exprs <- map(codes, build_recode_expressions)
    

    Now we can easily add exprs into the nested data frame as new list-column.

    There is another function may be useful for further work. This function takes a data frame and a list of quoted expressions and returns a new data frame with recoded columns.

    recode_df <- function(df, exprs) mutate(df, !!!exprs)
    

    It's time to combine all together. We have tibble-column df, list-column exprs and function recode_df that binds them together but one by one.

    The clue is map2 function. It allows us to iterate over two lists simultaneously:

    nested %>% 
      mutate(exprs = exprs) %>% 
      mutate(df_recoded = map2(df, exprs, recode_df)) %>% 
      unnest(df, df_recoded)
    

    And this is the output:

    # A tibble: 10 x 5
       group mar         wexp   mar1 wexp1 
       <chr> <fct>       <fct> <dbl> <chr> 
     1 yes   not married no        0 no    
     2 yes   not married no        0 no    
     3 yes   not married yes       0 yes   
     4 yes   married     yes       1 yes   
     5 yes   not married yes       0 yes   
     6 no    not married no       10 NOOOO 
     7 no    not married no       10 NOOOO 
     8 no    not married yes      10 YEEEES
     9 no    married     yes      20 YEEEES
    10 no    not married yes      10 YEEEES
    

    I hope this update will solve your problem.