How to make a user-defined function work nicely with pipes and group_by? Here is a simple function:
library(tidyverse)
fun_head <- function(df, column) {
column <- enquo(column)
df %>% select(!!column) %>% head(1)
}
The function works nicely with pipes and allows to filter by another column:
mtcars %>% filter(cyl == 4) %>% fun_head(mpg)
> mpg
1 22.8
However, the same pipe-work fails with group_by
mtcars %>% group_by(cyl) %>% fun_head(mpg)
Adding missing grouping variables: `cyl`
# A tibble: 1 x 2
# Groups: cyl [1]
cyl mpg
<dbl> <dbl>
1 6 21
Using "do" after group_by makes it work:
> mtcars %>% group_by(cyl) %>% do(fun_head(., mpg))
# A tibble: 3 x 2
# Groups: cyl [3]
cyl mpg
<dbl> <dbl>
1 4 22.8
2 6 21
3 8 18.7
How should the function be changed so that it works uniformly with filter and group_by without needing "do"?
Or quosures have nothing do with the question, and group_by just requires using "do" because the function in the example has multiple arguments?
This is independent of quosures. Here's the same issue in the absence of non-standard evaluation in fun_head()
:
fun_head <- function(df) {df %>% select(mpg) %>% head(1)}
mtcars %>% group_by( cyl ) %>% fun_head()
# Adding missing grouping variables: `cyl`
# # A tibble: 1 x 2
# # Groups: cyl [1]
# cyl mpg
# <dbl> <dbl>
# 1 6 21
As explained in other questions here and here, do
is the connector that allows you to apply arbitrary functions to each group. The reason dplyr
verbs such as mutate
and filter
don't require do
is because they handle grouped data frames internally as special cases (see, e.g., the implementation of mutate). If you want your own function to emulate this behavior, you would need to distinguish between grouped and ungrouped data frames:
fun_head2 <- function( df )
{
if( !is.null(groups(df)) )
df %>% do( fun_head2(.) )
else
df %>% select(mpg) %>% head(1)
}
mtcars %>% group_by(cyl) %>% fun_head2()
# # A tibble: 3 x 2
# # Groups: cyl [3]
# cyl mpg
# <dbl> <dbl>
# 1 4 22.8
# 2 6 21
# 3 8 18.7
EDIT: I want to point out that another alternative to group_by
+ do
is to use tidyr::nest
+ purrr::map
instead. Going back to your original function definition that takes two arguments:
fhead <- function(.df, .var) { .df %>% select(!!ensym(.var)) %>% head(1) }
The following two chains are equivalent (up to an ordering of rows, since group_by
sorts by the grouping variable and nest
doesn't):
# Option 1: group_by + do
mtcars %>% group_by(cyl) %>% do( fhead(., mpg) ) %>% ungroup
# Option 2: nest + map
mtcars %>% nest(-cyl) %>% mutate_at( "data", map, fhead, "mpg" ) %>% unnest