Search code examples
rdplyrtidyevalnse

Using complex RHS expressions in `dplyr::case_when()` in the context of `dplyr::mutate()`


Problem

I'm trying to use dplyr::mutate() and dplyr::case_when() to create a new column of data in a dataframe that is populated using data stored in another object (a "lookup list") and is based on information in columns in the dataframe.

I know the answer probably has to do with (im)proper use of quasiquotation and NSE, but I'm having trouble extrapolating the information in the Programming with dplyr vignette to my situation.

I'm hoping that posting this reprex here can guide me to the right answer, and I think that solving this problem will go a long ways towards helping me grok NSE.

Sample data

key_list <- list(
  "a" = list(
    foo = 1,
    bar = 2),
  "b" = list(
    foo = 3,
    bar = 4),
  "c" = list(
    foo = 5,
    bar = 6)
  )

x <- tibble(fruit = c("apple", "orange", "grape", "apple", "apple", "orange"),
            `Old Letter` = c("a", "a", "b", "c", "c", "c"),
            `Old Number` = c(9, 8, 7, 6, 5, 4)
            )

x

# # A tibble: 6 x 3
#   fruit  `Old Letter` `Old Number`
#   <chr>  <chr>               <dbl>
# 1 apple  a                       9
# 2 orange a                       8
# 3 grape  b                       7
# 4 apple  c                       6
# 5 apple  c                       5
# 6 orange c                       4

Goal

Specifically, I want to create a new column in x (which I will call `New Number`) that is populated based on the values in x$fruit and x$`Old Letter`.

Here's the code that mimicks where I get hung up in my actual use case:

x %>% mutate(`New Number` = case_when(
  fruit == "apple" ~ pluck(key_list, `Old Letter`, "foo") * 10,
  fruit == "orange" ~ pluck(key_list, `Old Letter`, "foo") * 100,
  fruit == "grape" ~ pluck(key_list, `Old Letter`, "foo") * 1000
  ))

# Error: Index 1 must have length 1, not 6

Expected output

In my mind, I see the (desired) order of operations like this, for e.g. the first row of x:

  • fruit == "apple" is TRUE, so evaluate this expression: pluck(key_list, `Old Letter`, "foo") * 10
  • Since the value in the `Old Letter` column for this row is "a", the expression becomes pluck(key_list, "a", "foo") * 10 (which should operate on the key_list object in the global environment)
  • This simplifies to 2 * 10 which equals 20
  • Place the result of evaluating this expression into the `New Number` column.

Extrapolating this to the entire command, I expected this as the output:

# # A tibble: 6 x 4
#   fruit  `Old Letter` `Old Number` `New Number`
#   <chr>  <chr>               <dbl>        <dbl>
# 1 apple  a                       9           20
# 2 orange a                       8          200
# 3 grape  b                       7         4000
# 4 apple  c                       6           60
# 5 apple  c                       5           60
# 6 orange c                       4          600

My take:

Judging from the error message I'm getting, it looks like rather than a single value from the `Old Letter` column being used as an index for pluck() to use, the entire `Old Letter` column is being passed as a vector. I'm guessing that this is because according to the documentation for case_when():

case_when() is not a tidy eval function.

I've tried to track down how this is happening, but the trace stack didn't seem to point me anywhere helpful, and wrapping the whole command in either rlang::qq_show() or quo() didn't show me how R was interpreting the command with respect to NSE because they both threw that same error as well.

I've experimented with combinations of:

  • quo(),
  • enquo(),
  • !!,
  • !!enquo() (shortened as {{}}), and
  • sym()

in the Reprex code above, as well as wrapping it up into a function, but it throws the same error:

get_num <- function(x, y) purrr::pluck(key_list, x, y)

x %>% mutate(`New Number` = case_when(
  fruit == "apple" ~ get_num(`Old Letter`, "foo") * 10,
  fruit == "orange" ~ get_num(`Old Letter`, "foo") * 100,
  fruit == "grape" ~ get_num(`Old Letter`, "foo") * 1000
  ))

# Error: Index 1 must have length 1, not 6

This answer to another question on SO tells me that

I guess what you are missing about case_when() is that the arguments are evaluated at once, not per row.

But I'm not sure if/how that applies to my situation, so I'm at a loss.

Anyways, thanks for any help you folks can provide!


sessionInfo():

R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.15

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rlang_0.4.1     readxl_1.3.1    forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3     purrr_0.3.3     readr_1.3.1     tidyr_1.0.0     tibble_2.1.3   
[10] ggplot2_3.2.1   tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2       cellranger_1.1.0 pillar_1.4.2     compiler_3.6.0   base64enc_0.1-3  tools_3.6.0      digest_0.6.22    zeallot_0.1.0    evaluate_0.14   
[10] lubridate_1.7.4  jsonlite_1.6     lifecycle_0.1.0  nlme_3.1-141     gtable_0.3.0     lattice_0.20-38  pkgconfig_2.0.3  cli_1.1.0        rstudioapi_0.10 
[19] yaml_2.2.0       haven_2.1.1      xfun_0.10        withr_2.1.2      xml2_1.2.2       httr_1.4.1       knitr_1.25       generics_0.0.2   vctrs_0.2.0     
[28] hms_0.5.1        grid_3.6.0       tidyselect_0.2.5 glue_1.3.1       R6_2.4.0         fansi_0.4.0      rmarkdown_1.16   modelr_0.1.5     magrittr_1.5    
[37] htmltools_0.4.0  backports_1.1.5  scales_1.0.0     rvest_0.3.4      assertthat_0.2.1 colorspace_1.4-1 utf8_1.1.4       stringi_1.4.3    lazyeval_0.2.2  
[46] munsell_0.5.0    broom_0.5.2      crayon_1.3.4 

Solution

  • I think the issue may have less to do with NSE than the fact that pluck is not vectorized - as currently written pluck is not evaluated once per row, but rather all rows are attempted to be run through pluck at once. However, as you figured out, pluck needs a single number input, not a vector.

    One way to solve this is to map the function across the rows, using your code as a little lambda-style function. Note that you need to use map_dbl to coerce a numeric value, otherwise map would return a list and everything would explode :-)

    x %>% 
      mutate(`New Number` = case_when(
        fruit == "apple" ~ map_dbl(`Old Letter`, ~ pluck(key_list, ., "foo")) * 10,
        fruit == "orange" ~ map_dbl(`Old Letter`, ~ pluck(key_list, ., "foo")) * 100,
        fruit == "grape" ~ map_dbl(`Old Letter`, ~ pluck(key_list, ., "foo")) * 1000
      ))
    
    # # A tibble: 6 x 4
    #   fruit  `Old Letter` `Old Number` `New Number`
    #   <chr>  <chr>               <dbl>        <dbl>
    # 1 apple  a                       9           10
    # 2 orange a                       8          100
    # 3 grape  b                       7         3000
    # 4 apple  c                       6           50
    # 5 apple  c                       5           50
    # 6 orange c                       4          500