I'm trying to use dplyr::mutate()
and dplyr::case_when()
to create a new column of data in a dataframe that is populated using data stored in another object (a "lookup list") and is based on information in columns in the dataframe.
I know the answer probably has to do with (im)proper use of quasiquotation and NSE, but I'm having trouble extrapolating the information in the Programming with dplyr
vignette to my situation.
I'm hoping that posting this reprex here can guide me to the right answer, and I think that solving this problem will go a long ways towards helping me grok NSE.
key_list <- list(
"a" = list(
foo = 1,
bar = 2),
"b" = list(
foo = 3,
bar = 4),
"c" = list(
foo = 5,
bar = 6)
)
x <- tibble(fruit = c("apple", "orange", "grape", "apple", "apple", "orange"),
`Old Letter` = c("a", "a", "b", "c", "c", "c"),
`Old Number` = c(9, 8, 7, 6, 5, 4)
)
x
# # A tibble: 6 x 3
# fruit `Old Letter` `Old Number`
# <chr> <chr> <dbl>
# 1 apple a 9
# 2 orange a 8
# 3 grape b 7
# 4 apple c 6
# 5 apple c 5
# 6 orange c 4
Specifically, I want to create a new column in x
(which I will call `New Number`
) that is populated based on the values in x$fruit
and x$`Old Letter`
.
Here's the code that mimicks where I get hung up in my actual use case:
x %>% mutate(`New Number` = case_when(
fruit == "apple" ~ pluck(key_list, `Old Letter`, "foo") * 10,
fruit == "orange" ~ pluck(key_list, `Old Letter`, "foo") * 100,
fruit == "grape" ~ pluck(key_list, `Old Letter`, "foo") * 1000
))
# Error: Index 1 must have length 1, not 6
In my mind, I see the (desired) order of operations like this, for e.g. the first row of x
:
fruit == "apple"
is TRUE, so evaluate this expression: pluck(key_list, `Old Letter`, "foo") * 10
`Old Letter`
column for this row is "a"
, the expression becomes pluck(key_list, "a", "foo") * 10
(which should operate on the key_list
object in the global environment)2 * 10
which equals 20
`New Number`
column.Extrapolating this to the entire command, I expected this as the output:
# # A tibble: 6 x 4
# fruit `Old Letter` `Old Number` `New Number`
# <chr> <chr> <dbl> <dbl>
# 1 apple a 9 20
# 2 orange a 8 200
# 3 grape b 7 4000
# 4 apple c 6 60
# 5 apple c 5 60
# 6 orange c 4 600
Judging from the error message I'm getting, it looks like rather than a single value from the `Old Letter`
column being used as an index for pluck()
to use, the entire `Old Letter`
column is being passed as a vector. I'm guessing that this is because according to the documentation for case_when()
:
case_when()
is not a tidy eval function.
I've tried to track down how this is happening, but the trace stack didn't seem to point me anywhere helpful, and wrapping the whole command in either rlang::qq_show()
or quo()
didn't show me how R was interpreting the command with respect to NSE because they both threw that same error as well.
I've experimented with combinations of:
quo()
, enquo()
,!!
, !!enquo()
(shortened as {{}}
), andsym()
in the Reprex
code above, as well as wrapping it up into a function, but it throws the same error:
get_num <- function(x, y) purrr::pluck(key_list, x, y)
x %>% mutate(`New Number` = case_when(
fruit == "apple" ~ get_num(`Old Letter`, "foo") * 10,
fruit == "orange" ~ get_num(`Old Letter`, "foo") * 100,
fruit == "grape" ~ get_num(`Old Letter`, "foo") * 1000
))
# Error: Index 1 must have length 1, not 6
This answer to another question on SO tells me that
I guess what you are missing about
case_when()
is that the arguments are evaluated at once, not per row.
But I'm not sure if/how that applies to my situation, so I'm at a loss.
Anyways, thanks for any help you folks can provide!
sessionInfo()
:R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS 10.15
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rlang_0.4.1 readxl_1.3.1 forcats_0.4.0 stringr_1.4.0 dplyr_0.8.3 purrr_0.3.3 readr_1.3.1 tidyr_1.0.0 tibble_2.1.3
[10] ggplot2_3.2.1 tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.2 cellranger_1.1.0 pillar_1.4.2 compiler_3.6.0 base64enc_0.1-3 tools_3.6.0 digest_0.6.22 zeallot_0.1.0 evaluate_0.14
[10] lubridate_1.7.4 jsonlite_1.6 lifecycle_0.1.0 nlme_3.1-141 gtable_0.3.0 lattice_0.20-38 pkgconfig_2.0.3 cli_1.1.0 rstudioapi_0.10
[19] yaml_2.2.0 haven_2.1.1 xfun_0.10 withr_2.1.2 xml2_1.2.2 httr_1.4.1 knitr_1.25 generics_0.0.2 vctrs_0.2.0
[28] hms_0.5.1 grid_3.6.0 tidyselect_0.2.5 glue_1.3.1 R6_2.4.0 fansi_0.4.0 rmarkdown_1.16 modelr_0.1.5 magrittr_1.5
[37] htmltools_0.4.0 backports_1.1.5 scales_1.0.0 rvest_0.3.4 assertthat_0.2.1 colorspace_1.4-1 utf8_1.1.4 stringi_1.4.3 lazyeval_0.2.2
[46] munsell_0.5.0 broom_0.5.2 crayon_1.3.4
I think the issue may have less to do with NSE than the fact that pluck
is not vectorized - as currently written pluck
is not evaluated once per row, but rather all rows are attempted to be run through pluck
at once. However, as you figured out, pluck
needs a single number input, not a vector.
One way to solve this is to map
the function across the rows, using your code as a little lambda-style function. Note that you need to use map_dbl
to coerce a numeric value, otherwise map
would return a list and everything would explode :-)
x %>%
mutate(`New Number` = case_when(
fruit == "apple" ~ map_dbl(`Old Letter`, ~ pluck(key_list, ., "foo")) * 10,
fruit == "orange" ~ map_dbl(`Old Letter`, ~ pluck(key_list, ., "foo")) * 100,
fruit == "grape" ~ map_dbl(`Old Letter`, ~ pluck(key_list, ., "foo")) * 1000
))
# # A tibble: 6 x 4
# fruit `Old Letter` `Old Number` `New Number`
# <chr> <chr> <dbl> <dbl>
# 1 apple a 9 10
# 2 orange a 8 100
# 3 grape b 7 3000
# 4 apple c 6 50
# 5 apple c 5 50
# 6 orange c 4 500