I want to be able to construct function calls dynamically with varying grouping variables/arguments using dplyr. The number of function calls may be quite large, which means the examples in the programming with dplyr vignette are not practical. Ideally I want to be able to construct an object (e.g. a list) beforehand which stores the arguments/variables to be passed in each function call. Below is an example dataset, where we want to apply some summarising functions based on changing grouping variables.
set.seed(1)
df <- data.frame(values = sample(x = 1:10, size = 10),
grouping_var1 = sample(x = letters[1:2], size = 10, replace = TRUE),
grouping_var2 = sample(x = letters[24:26], size = 10, replace = TRUE),
grouping_var3 = sample(x = LETTERS[1:2], size = 10, replace = TRUE))
> df
values grouping_var1 grouping_var2 grouping_var3
1 9 a x B
2 4 a z B
3 7 a x A
4 1 a x B
5 2 a x A
6 5 b x A
7 3 b y B
8 10 b x A
9 6 b x B
10 8 a y B
Following the programming with dplyr vignette we could come up with a solution like this:
f <- function(df, ...){
group_var <- enquos(...)
df %>%
group_by(!!! group_var) %>%
summarise_at(.vars = "values", .funs = sum) %>%
print(n = 10)
}
> f(df, grouping_var1)
# A tibble: 2 x 2
grouping_var1 values
<fct> <int>
1 a 31
2 b 24
> f(df, grouping_var1, grouping_var2)
# A tibble: 5 x 3
# Groups: grouping_var1 [2]
grouping_var1 grouping_var2 values
<fct> <fct> <int>
1 a x 19
2 a y 8
3 a z 4
4 b x 21
5 b y 3
The example above is impractical and inflexible if I want to construct a large number of calls. Another limitation is that other information I may wish to include cannot easily be passed together or in addition to the grouping variables.
Assume I have a list containing grouping variables I want to pass in each function call. Assume also for each of those list elements there is a separate field with an "id" describing the grouping which was performed. See below for an example:
list(group_vars = list(c("grouping_var1"),
c("grouping_var1", "grouping_var2"),
c("grouping_var1", "grouping_var3")),
group_ids = list("var_1",
c("var_1_2"),
c("var_1_3")))
How do I dynamically pass these lists of arguments/variables and ids to function calls and have them be successfully evaluated using dplyr? Let's say I want to create a column in the resulting dataframe which aside from the summarised data also contains the group_ids. For example if my group_vars
were c("grouping_var1", "grouping_var2")
and the group_ids
was "var_1_2"
for a specific function call I would expect the output:
# A tibble: 5 x 4
# Groups: grouping_var1 [2]
grouping_var1 grouping_var2 values group_ids
<fct> <fct> <int> <chr>
1 a x 19 var_1_2
2 a y 8 var_1_2
3 a z 4 var_1_2
4 b x 21 var_1_2
5 b y 3 var_1_2
I am hoping to see a solution implementing this without using the nowadays deprecated group_by_
functions which accept strings.
On an ending note, I feel it is rather discouraging that programming with dplyr in functions using NSE has such a barrier to entry. Anytime I get stuck with something that should be simple it takes hours to find a solution.
I'm not sure what the "standard" tidyverse approach is here, as I never really have a sense of whether I'm "doing it right" when I try to write generalized tidyverse functions for my typical workflows, but here's another approach.*
First, we can generate a list of combinations of grouping columns, rather than hard-coding them. In this case, the list includes all possible combinations of 1, 2, or 3 grouping columns, but that can be pared back as needed.
library(tidyverse)
# Generate a list of combinations of grouping variables.
groups.list = map(1:3, ~combn(names(df)[map_lgl(df, ~!is.numeric(.))], .x, simplify=FALSE)) %>%
flatten
Below is a summary function that uses group_by_at
, which can take strings, so there's no need for non-standard evaluation. In addition, we get the group.ids
values from group_vars
itself, so we don't need a separate parameter or argument (though this may need to be tweaked, depending on what you expect for the names of the grouping columns).
# Summarise for each combination of groups
# Generate group.ids from group_vars itself
f2 <- function(data, group_vars) {
data %>%
group_by_at(group_vars) %>%
summarise(values=sum(values)) %>%
mutate(group.ids=paste0("var_", paste(str_extract(group_vars, "[0-9]"), collapse="_")))
}
Now we can run the run the function on every element of group.list
map(groups.list, ~f2(df, .x))
[[1]] # A tibble: 2 x 3 grouping_var1 values group.ids <fct> <int> <chr> 1 a 31 var_1 2 b 24 var_1 [[2]] # A tibble: 3 x 3 grouping_var2 values group.ids <fct> <int> <chr> 1 x 40 var_2 2 y 11 var_2 3 z 4 var_2 [[3]] # A tibble: 2 x 3 grouping_var3 values group.ids <fct> <int> <chr> 1 A 24 var_3 2 B 31 var_3 [[4]] # A tibble: 5 x 4 # Groups: grouping_var1 [2] grouping_var1 grouping_var2 values group.ids <fct> <fct> <int> <chr> 1 a x 19 var_1_2 2 a y 8 var_1_2 3 a z 4 var_1_2 4 b x 21 var_1_2 5 b y 3 var_1_2 [[5]] # A tibble: 4 x 4 # Groups: grouping_var1 [2] grouping_var1 grouping_var3 values group.ids <fct> <fct> <int> <chr> 1 a A 9 var_1_3 2 a B 22 var_1_3 3 b A 15 var_1_3 4 b B 9 var_1_3 [[6]] # A tibble: 4 x 4 # Groups: grouping_var2 [3] grouping_var2 grouping_var3 values group.ids <fct> <fct> <int> <chr> 1 x A 24 var_2_3 2 x B 16 var_2_3 3 y B 11 var_2_3 4 z B 4 var_2_3 [[7]] # A tibble: 7 x 5 # Groups: grouping_var1, grouping_var2 [5] grouping_var1 grouping_var2 grouping_var3 values group.ids <fct> <fct> <fct> <int> <chr> 1 a x A 9 var_1_2_3 2 a x B 10 var_1_2_3 3 a y B 8 var_1_2_3 4 a z B 4 var_1_2_3 5 b x A 15 var_1_2_3 6 b x B 6 var_1_2_3 7 b y B 3 var_1_2_3
Or, if you want to combine all of the results, you could do something like this:
map(groups.list, ~f2(df, .x)) %>%
bind_rows() %>%
mutate_if(is.factor, fct_explicit_na, na_level="All") %>%
select(group.ids, matches("grouping"), values)
group.ids grouping_var1 grouping_var2 grouping_var3 values <chr> <fct> <fct> <fct> <int> 1 var_1 a All All 31 2 var_1 b All All 24 3 var_2 All x All 40 4 var_2 All y All 11 5 var_2 All z All 4 6 var_3 All All A 24 7 var_3 All All B 31 8 var_1_2 a x All 19 9 var_1_2 a y All 8 10 var_1_2 a z All 4 11 var_1_2 b x All 21 12 var_1_2 b y All 3 13 var_1_3 a All A 9 14 var_1_3 a All B 22 15 var_1_3 b All A 15 16 var_1_3 b All B 9 17 var_2_3 All x A 24 18 var_2_3 All x B 16 19 var_2_3 All y B 11 20 var_2_3 All z B 4 21 var_1_2_3 a x A 9 22 var_1_2_3 a x B 10 23 var_1_2_3 a y B 8 24 var_1_2_3 a z B 4 25 var_1_2_3 b x A 15 26 var_1_2_3 b x B 6 27 var_1_2_3 b y B 3