Search code examples
rdplyrrlangtidyevalquosure

dplyr group_by multiple function arguments via Non Standard Evaluation


I was reading dplyr's vignette trying to figure out how to use dplyr in my function codes. Mid way through it talks about how to use enquos on ... in order to pass multiple arguments to group_by.

a short example of how it would work

grp <- rlang::enquos(...)
df %>%
    group_by(!!!grp)

I didn't know if there was a way to assign multiple expression values without reserving ... without doing some questionable coding.

To get an idea of what the call would look like use the following example:

#reproducable data
df <- datasets::USJudgeRatings
df$name <- rownames(df)
df <- tidyr::gather(df, key = "key", value = "value", -name)
df$dummy <- c("1","2")


test_summarize <- function(df, sum.col, grp = NULL, filter = NULL) {
  filter <- rlang::enquo(filter)
  sum.col <- rlang::enquo(sum.col)
  if(!is.null(rlang::get_expr(filter))){
    df <- dplyr::filter(df, !!filter)
  }

  #how grp is turned into a character vector to be passed to .dots in group_by
  grp <- substitute(grp)
  if(!is.null(grp)){
    grp <- deparse(grp)
    grp <- strsplit(gsub(pattern = "list\\(|c\\(|\\)|", replacement = "", x = grp), split =",")[[1]]
    grp <- gsub(pattern = "^ | $", replacement = "", x = grp)
   df %>%
      dplyr::group_by(.dots=grp) %>%
      dplyr::summarise(mean = mean(!!sum.col), sum = sum(!!sum.col), n = n())
  } else{
    df %>%
      dplyr::summarise(mean = mean(!!sum.col), sum = sum(!!sum.col), n = n())
  }

}

test_summarize(df, sum.col=value, grp = c(name, dummy))

# A tibble: 86 x 5
# Groups:   name [?]
   name           dummy  mean   sum     n
   <chr>          <fct> <dbl> <dbl> <int>
 1 AARONSON,L.H.  1      7.17  43       6
 2 AARONSON,L.H.  2      7.42  44.5     6
 3 ALEXANDER,J.M. 1      8.35  50.1     6
 4 ALEXANDER,J.M. 2      7.95  47.7     6
 5 ARMENTANO,A.J. 1      7.53  45.2     6
 6 ARMENTANO,A.J. 2      7.7   46.2     6
 7 BERDON,R.I.    1      8.67  52       6
 8 BERDON,R.I.    2      8.25  49.5     6
 9 BRACKEN,J.J.   1      5.65  33.9     6
10 BRACKEN,J.J.   2      5.82  34.9     6
# ... with 76 more rows

This works for what I was trying to do, but I was wondering if there was a better way to accept the arguments and handle them. Every attempt I made in turning the original grp call into something that resembles what enquos(...) failed so I did a deparsing and turned them into a character vector, which honestly I should probably just expect the user to pass characters?

I am opting to not use a character vector as the expected input because I was trying to remain consistent considering that sum.col and filter arguments of the function expect NSE expressions. Maybe there is something in the rlang package that will convert each element of the original expression into a list of quosures?

Edit: fixed reproducible example and provided expected output


Solution

  • If we use group_by_at, we may not need the if/else argument

    test_summarize <- function(df, sum.col, grp = NULL, filter = NULL) {
    df %>% 
         group_by_at(grp) %>%
         summarise(mean = mean({{sum.col}}), 
                   sum = sum({{sum.col}}), n = n())
    
       }
    
    
    test_summarize(df, sum.col=value, grp = c("name", "dummy"))
    # A tibble: 86 x 5
    # Groups:   name [43]
    #   name           dummy  mean   sum     n
    #   <chr>          <chr> <dbl> <dbl> <int>
    # 1 AARONSON,L.H.  1      7.17  43       6
    # 2 AARONSON,L.H.  2      7.42  44.5     6
    # 3 ALEXANDER,J.M. 1      8.35  50.1     6
    # 4 ALEXANDER,J.M. 2      7.95  47.7     6
    # 5 ARMENTANO,A.J. 1      7.53  45.2     6
    # 6 ARMENTANO,A.J. 2      7.7   46.2     6
    # 7 BERDON,R.I.    1      8.67  52       6
    # 8 BERDON,R.I.    2      8.25  49.5     6
    # 9 BRACKEN,J.J.   1      5.65  33.9     6
    #10 BRACKEN,J.J.   2      5.82  34.9     6
    # … with 76 more rows
    
    
    
    test_summarize(df, sum.col=value)
    # A tibble: 1 x 3
    #   mean   sum     n
    #  <dbl> <dbl> <int>
    #1  7.57 3908.   516
    

    which is the same as

    df %>%
       summarise(mean = mean(value), sum = sum(value), n = n())
    #     mean    sum   n
    #1 7.57345 3907.9 516
    

    If we use filter, then one option is ... and pass as many filter conditions

    test_summarize <- function(df, sum.col, grp = NULL, ...) {
        df %>% 
             filter(!!! rlang::enexprs(...)) %>%
             group_by_at(grp) %>%
             summarise(mean = mean({{sum.col}}), sum = sum({{sum.col}}), n = n())
    
    }
    
    
    test_summarize(df, sum.col=value, grp = c("name", "dummy"),
            key %in% c("CONT", "INTG"), value > 6.5)
    # A tibble: 77 x 5
    # Groups:   name [43]
    #   name           dummy  mean   sum     n
    #   <chr>          <chr> <dbl> <dbl> <int>
    # 1 AARONSON,L.H.  2       7.9   7.9     1
    # 2 ALEXANDER,J.M. 1       8.9   8.9     1
    # 3 ALEXANDER,J.M. 2       6.8   6.8     1
    # 4 ARMENTANO,A.J. 1       7.2   7.2     1
    # 5 ARMENTANO,A.J. 2       8.1   8.1     1
    # 6 BERDON,R.I.    1       8.8   8.8     1
    # 7 BERDON,R.I.    2       6.8   6.8     1
    # 8 BRACKEN,J.J.   1       7.3   7.3     1
    # 9 BURNS,E.B.     1       8.8   8.8     1
    #10 CALLAHAN,R.J.  1      10.6  10.6     1
    # … with 67 more rows
    

    and this will also evaluate when there are no filter arguments

    test_summarize(df, sum.col=value, grp = c("name", "dummy"))
    # A tibble: 86 x 5
    # Groups:   name [43]
    #   name           dummy  mean   sum     n
    #   <chr>          <chr> <dbl> <dbl> <int>
    # 1 AARONSON,L.H.  1      7.17  43       6
    # 2 AARONSON,L.H.  2      7.42  44.5     6
    # 3 ALEXANDER,J.M. 1      8.35  50.1     6
    # 4 ALEXANDER,J.M. 2      7.95  47.7     6
    # 5 ARMENTANO,A.J. 1      7.53  45.2     6
    # 6 ARMENTANO,A.J. 2      7.7   46.2     6
    # 7 BERDON,R.I.    1      8.67  52       6
    # 8 BERDON,R.I.    2      8.25  49.5     6
    # 9 BRACKEN,J.J.   1      5.65  33.9     6
    #10 BRACKEN,J.J.   2      5.82  34.9     6
    # … with 76 more rows
    

    which is the same as thee first output