Search code examples
rdplyrtidyversequosure

R: why group_by still requires "do" even when using quosures


How to make a user-defined function work nicely with pipes and group_by? Here is a simple function:

 library(tidyverse)

 fun_head <- function(df, column) {
 column <- enquo(column)
 df %>% select(!!column) %>% head(1)
 }

The function works nicely with pipes and allows to filter by another column:

 mtcars %>% filter(cyl == 4) %>% fun_head(mpg)

 >    mpg
   1 22.8

However, the same pipe-work fails with group_by

mtcars %>% group_by(cyl) %>% fun_head(mpg)

Adding missing grouping variables: `cyl`
# A tibble: 1 x 2
# Groups:   cyl [1]
     cyl   mpg
     <dbl> <dbl>
1     6    21

Using "do" after group_by makes it work:

 > mtcars %>% group_by(cyl) %>% do(fun_head(., mpg))
 # A tibble: 3 x 2
 # Groups:   cyl [3]
    cyl   mpg
   <dbl> <dbl>
1     4  22.8
2     6  21  
3     8  18.7

How should the function be changed so that it works uniformly with filter and group_by without needing "do"?
Or quosures have nothing do with the question, and group_by just requires using "do" because the function in the example has multiple arguments?


Solution

  • This is independent of quosures. Here's the same issue in the absence of non-standard evaluation in fun_head():

    fun_head <- function(df) {df %>% select(mpg) %>% head(1)}
    mtcars %>% group_by( cyl ) %>% fun_head()
    # Adding missing grouping variables: `cyl`
    # # A tibble: 1 x 2
    # # Groups:   cyl [1]
    #     cyl   mpg
    #   <dbl> <dbl>
    # 1     6    21
    

    As explained in other questions here and here, do is the connector that allows you to apply arbitrary functions to each group. The reason dplyr verbs such as mutate and filter don't require do is because they handle grouped data frames internally as special cases (see, e.g., the implementation of mutate). If you want your own function to emulate this behavior, you would need to distinguish between grouped and ungrouped data frames:

    fun_head2 <- function( df )
    {
      if( !is.null(groups(df)) )
        df %>% do( fun_head2(.) )
      else
        df %>% select(mpg) %>% head(1)
    }
    
    mtcars %>% group_by(cyl) %>% fun_head2()
    # # A tibble: 3 x 2
    # # Groups:   cyl [3]
    #     cyl   mpg
    #   <dbl> <dbl>
    # 1     4  22.8
    # 2     6  21  
    # 3     8  18.7
    

    EDIT: I want to point out that another alternative to group_by + do is to use tidyr::nest + purrr::map instead. Going back to your original function definition that takes two arguments:

    fhead <- function(.df, .var) { .df %>% select(!!ensym(.var)) %>% head(1) }
    

    The following two chains are equivalent (up to an ordering of rows, since group_by sorts by the grouping variable and nest doesn't):

    # Option 1: group_by + do
    mtcars %>% group_by(cyl) %>% do( fhead(., mpg) ) %>% ungroup
    
    # Option 2: nest + map
    mtcars %>% nest(-cyl) %>% mutate_at( "data", map, fhead, "mpg" ) %>% unnest