Search code examples
rdplyrrlangnse

Dynamically construct function calls with varying arguments using dplyr and NSE


I want to be able to construct function calls dynamically with varying grouping variables/arguments using dplyr. The number of function calls may be quite large, which means the examples in the programming with dplyr vignette are not practical. Ideally I want to be able to construct an object (e.g. a list) beforehand which stores the arguments/variables to be passed in each function call. Below is an example dataset, where we want to apply some summarising functions based on changing grouping variables.

set.seed(1)
df <- data.frame(values = sample(x = 1:10, size = 10),
                 grouping_var1 = sample(x = letters[1:2], size = 10, replace = TRUE),
                 grouping_var2 = sample(x = letters[24:26], size = 10, replace = TRUE),
                 grouping_var3 = sample(x = LETTERS[1:2], size = 10, replace = TRUE))

> df
   values grouping_var1 grouping_var2 grouping_var3
1       9             a             x             B
2       4             a             z             B
3       7             a             x             A
4       1             a             x             B
5       2             a             x             A
6       5             b             x             A
7       3             b             y             B
8      10             b             x             A
9       6             b             x             B
10      8             a             y             B

Following the programming with dplyr vignette we could come up with a solution like this:

f <- function(df, ...){
  group_var <- enquos(...)

  df %>%
    group_by(!!! group_var) %>%
    summarise_at(.vars = "values", .funs = sum) %>%
    print(n = 10)
}

> f(df, grouping_var1)
# A tibble: 2 x 2
  grouping_var1 values
  <fct>          <int>
1 a                 31
2 b                 24

> f(df, grouping_var1, grouping_var2)
# A tibble: 5 x 3
# Groups:   grouping_var1 [2]
  grouping_var1 grouping_var2 values
  <fct>         <fct>          <int>
1 a             x                 19
2 a             y                  8
3 a             z                  4
4 b             x                 21
5 b             y                  3

The example above is impractical and inflexible if I want to construct a large number of calls. Another limitation is that other information I may wish to include cannot easily be passed together or in addition to the grouping variables.

Assume I have a list containing grouping variables I want to pass in each function call. Assume also for each of those list elements there is a separate field with an "id" describing the grouping which was performed. See below for an example:

list(group_vars = list(c("grouping_var1"),
                       c("grouping_var1", "grouping_var2"),
                       c("grouping_var1", "grouping_var3")),
     group_ids = list("var_1",
                      c("var_1_2"),
                      c("var_1_3")))

How do I dynamically pass these lists of arguments/variables and ids to function calls and have them be successfully evaluated using dplyr? Let's say I want to create a column in the resulting dataframe which aside from the summarised data also contains the group_ids. For example if my group_vars were c("grouping_var1", "grouping_var2") and the group_ids was "var_1_2" for a specific function call I would expect the output:

# A tibble: 5 x 4
# Groups:   grouping_var1 [2]
  grouping_var1 grouping_var2 values group_ids
  <fct>         <fct>          <int> <chr>    
1 a             x                 19 var_1_2  
2 a             y                  8 var_1_2  
3 a             z                  4 var_1_2  
4 b             x                 21 var_1_2  
5 b             y                  3 var_1_2 

I am hoping to see a solution implementing this without using the nowadays deprecated group_by_ functions which accept strings.

On an ending note, I feel it is rather discouraging that programming with dplyr in functions using NSE has such a barrier to entry. Anytime I get stuck with something that should be simple it takes hours to find a solution.


Solution

  • I'm not sure what the "standard" tidyverse approach is here, as I never really have a sense of whether I'm "doing it right" when I try to write generalized tidyverse functions for my typical workflows, but here's another approach.*

    First, we can generate a list of combinations of grouping columns, rather than hard-coding them. In this case, the list includes all possible combinations of 1, 2, or 3 grouping columns, but that can be pared back as needed.

    library(tidyverse)
    
    # Generate a list of combinations of grouping variables.
    groups.list = map(1:3, ~combn(names(df)[map_lgl(df, ~!is.numeric(.))], .x, simplify=FALSE)) %>% 
      flatten
    

    Below is a summary function that uses group_by_at, which can take strings, so there's no need for non-standard evaluation. In addition, we get the group.ids values from group_vars itself, so we don't need a separate parameter or argument (though this may need to be tweaked, depending on what you expect for the names of the grouping columns).

    # Summarise for each combination of groups
    # Generate group.ids from group_vars itself
    f2 <- function(data, group_vars) {
    
      data %>%
        group_by_at(group_vars) %>%
        summarise(values=sum(values)) %>% 
        mutate(group.ids=paste0("var_", paste(str_extract(group_vars, "[0-9]"), collapse="_")))
    
      }
    

    Now we can run the run the function on every element of group.list

    map(groups.list, ~f2(df, .x))
    
    [[1]]
    # A tibble: 2 x 3
      grouping_var1 values group.ids
      <fct>          <int> <chr>    
    1 a                 31 var_1    
    2 b                 24 var_1    
    
    [[2]]
    # A tibble: 3 x 3
      grouping_var2 values group.ids
      <fct>          <int> <chr>    
    1 x                 40 var_2    
    2 y                 11 var_2    
    3 z                  4 var_2    
    
    [[3]]
    # A tibble: 2 x 3
      grouping_var3 values group.ids
      <fct>          <int> <chr>    
    1 A                 24 var_3    
    2 B                 31 var_3    
    
    [[4]]
    # A tibble: 5 x 4
    # Groups:   grouping_var1 [2]
      grouping_var1 grouping_var2 values group.ids
      <fct>         <fct>          <int> <chr>    
    1 a             x                 19 var_1_2  
    2 a             y                  8 var_1_2  
    3 a             z                  4 var_1_2  
    4 b             x                 21 var_1_2  
    5 b             y                  3 var_1_2  
    
    [[5]]
    # A tibble: 4 x 4
    # Groups:   grouping_var1 [2]
      grouping_var1 grouping_var3 values group.ids
      <fct>         <fct>          <int> <chr>    
    1 a             A                  9 var_1_3  
    2 a             B                 22 var_1_3  
    3 b             A                 15 var_1_3  
    4 b             B                  9 var_1_3  
    
    [[6]]
    # A tibble: 4 x 4
    # Groups:   grouping_var2 [3]
      grouping_var2 grouping_var3 values group.ids
      <fct>         <fct>          <int> <chr>    
    1 x             A                 24 var_2_3  
    2 x             B                 16 var_2_3  
    3 y             B                 11 var_2_3  
    4 z             B                  4 var_2_3  
    
    [[7]]
    # A tibble: 7 x 5
    # Groups:   grouping_var1, grouping_var2 [5]
      grouping_var1 grouping_var2 grouping_var3 values group.ids
      <fct>         <fct>         <fct>          <int> <chr>    
    1 a             x             A                  9 var_1_2_3
    2 a             x             B                 10 var_1_2_3
    3 a             y             B                  8 var_1_2_3
    4 a             z             B                  4 var_1_2_3
    5 b             x             A                 15 var_1_2_3
    6 b             x             B                  6 var_1_2_3
    7 b             y             B                  3 var_1_2_3
    

    Or, if you want to combine all of the results, you could do something like this:

    map(groups.list, ~f2(df, .x)) %>% 
      bind_rows() %>% 
      mutate_if(is.factor, fct_explicit_na, na_level="All") %>% 
      select(group.ids, matches("grouping"), values)
    
       group.ids grouping_var1 grouping_var2 grouping_var3 values
       <chr>     <fct>         <fct>         <fct>          <int>
     1 var_1     a             All           All               31
     2 var_1     b             All           All               24
     3 var_2     All           x             All               40
     4 var_2     All           y             All               11
     5 var_2     All           z             All                4
     6 var_3     All           All           A                 24
     7 var_3     All           All           B                 31
     8 var_1_2   a             x             All               19
     9 var_1_2   a             y             All                8
    10 var_1_2   a             z             All                4
    11 var_1_2   b             x             All               21
    12 var_1_2   b             y             All                3
    13 var_1_3   a             All           A                  9
    14 var_1_3   a             All           B                 22
    15 var_1_3   b             All           A                 15
    16 var_1_3   b             All           B                  9
    17 var_2_3   All           x             A                 24
    18 var_2_3   All           x             B                 16
    19 var_2_3   All           y             B                 11
    20 var_2_3   All           z             B                  4
    21 var_1_2_3 a             x             A                  9
    22 var_1_2_3 a             x             B                 10
    23 var_1_2_3 a             y             B                  8
    24 var_1_2_3 a             z             B                  4
    25 var_1_2_3 b             x             A                 15
    26 var_1_2_3 b             x             B                  6
    27 var_1_2_3 b             y             B                  3
    
    • This question was cross-posted to RStudio Community and I've added this answer there as well.