Search code examples
rloopsdplyrpurrrnse

R: Looping over custom dplyr function


I want to build a custom dplyr function and iterate over it ideally with purrr::map to stay in the tidyverse.

To keep things as easy as possible I replicate my problem using a very simple summarize function.

When buildings custom functions with dplyr I ran into the problem of non-standard evaluation (NSE). I found three different ways to deal with it. Each way of dealing with NSE works fine when the function is called directly, but not when looping over it. Below you’ll find the code to replicate my problem. What would be the correct way to make my function work with purrr::map?

    # loading libraries
    library(dplyr)
    library(tidyr)
    library(purrr)

    # generate test data
    test_tbl <- rbind(tibble(group = rep(sample(letters[1:4], 150, TRUE), each = 4),
                             score = sample(0:10, size = 600, replace = TRUE)),

                      tibble(group = rep(sample(letters[5:7], 50, TRUE), each = 3),
                             score = sample(0:10, size = 150, replace = TRUE))
    )




    # generate two variables to loop over
    test_tbl$group2 <- test_tbl$group
    vars <- c("group", "group2")


    # summarise function 1 using enquo()
    sum_tbl1 <- function(df, x) {

        x <- dplyr::enquo(x)

        df %>%
            dplyr::group_by(!! x) %>%
            dplyr::summarise(score = mean(score, na.rm =TRUE),
                             n = dplyr::n())

    }

    # summarise function 2 using .dots = lazyeval
    sum_tbl2 <- function(df, x) {

        df %>%
            dplyr::group_by_(.dots = lazyeval::lazy(x)) %>%
            dplyr::summarize(score = mean(score, na.rm =TRUE),
                             n = dplyr::n())

    }

    # summarise function 3 using ensym()
    sum_tbl3 <- function(df, x) {

        df %>%
            dplyr::group_by(!!rlang::ensym(x)) %>%
            dplyr::summarize(score = mean(score, na.rm =TRUE),
                             n = dplyr::n())

    }


    # Looping over the functions with map
    # each variation produces an error no matter which function I choose

    # call within anonymous function without pipe
    map(vars, function(x) sum_tbl1(test_tbl, x))
    map(vars, function(x) sum_tbl2(test_tbl, x))
    map(vars, function(x) sum_tbl3(test_tbl, x))

    # call within anonymous function witin pipe
    map(vars, function(x) test_tbl %>% sum_tbl1(x))
    map(vars, function(x) test_tbl %>% sum_tbl2(x))
    map(vars, function(x) test_tbl %>% sum_tbl3(x))

    # call with formular notation without pipe
    map(vars, ~sum_tbl1(test_tbl, .x))
    map(vars, ~sum_tbl2(test_tbl, .x))
    map(vars, ~sum_tbl3(test_tbl, .x))

    # call with formular notation within pipe
    map(vars,  ~test_tbl %>% sum_tbl1(.x))
    map(vars,  ~test_tbl %>% sum_tbl2(.x))
    map(vars,  ~test_tbl %>% sum_tbl3(.x))

I know that there are other solutions for producing summarize tables in a loop, like calling map directly and creating an anonymous function inside map (see code below). However, the problem I am interested in is how to deal with NSE in loops in general.

# One possibility to create summarize tables in loops with map
 vars %>%
    map(function(x){
        test_tbl %>%
            dplyr::group_by(!!rlang::ensym(x)) %>%
            dplyr::summarize(score = mean(score, na.rm =TRUE),
                             n = dplyr::n())
    })

Update:

Below akrun provides a solution that makes the call via purrr::map() possible. A direct call to the function is then however only possible by calling the grouping variable as a string either directly

sum_tbl(test_tbl, “group”)

or indirectly as

sum_tbl(test_tbl, vars[1])

In this solution it is not possible to call the grouping variable in a normal dplyr way as

sum_tbl(test_tbl, group)

Eventually, it seems to me that solutions to NSE in custom dpylr functions can address the problem either at the level of the function call itself, then using map/lapply is not possible, or NSE can be adressed to work with iterations, then variables can only be called as "strings".

Building on akruns answer I built a workaround function which allows both strings and normal variable names in the function call. However, there are definitely better ways to make this possible. Ideally, there is a more straight-forward way of dealing with NSE in custom dplyr functions, so that a workaround, like the one below, is not necessary in the first place.

sum_tbl <- function(df, x) {

        x_var <- dplyr::enquo(x)

        x_env <- rlang::get_env(x_var)

        if(identical(x_env,empty_env())) {

            # works, when x is a string and in loops via map/lapply
            sum_tbl <- df %>%
                dplyr::group_by(!! rlang::sym(x)) %>%
                dplyr::summarise(score = mean(score, na.rm = TRUE),
                                 n = dplyr::n())

        } else {
            # works, when x is a normal variable name without quotation marks
            x = dplyr::enquo(x)

            sum_tbl <- df %>%
                dplyr::group_by(!! x) %>%
                dplyr::summarise(score = mean(score, na.rm = TRUE),
                                 n = dplyr::n())
        }

        return(sum_tbl)
    }

Final update/solution

In an updated version of his answer akrun provides a solution which accounts for four ways of calling variable x:

  1. as a normal (non-string) variable name: sum_tbl(test_tbl, group)
  2. as a string name: sum_tbl(test_tbl, "group")
  3. as an indexed vector: sum_tbl(test_tbl, !!vars[1])
  4. and as a vector within purr::map(): map(vars, ~ sum_tbl(test_tbl, !!.x))

In (3) and (4) it is necessary to unquote the variable x using !!.

If I would use the function for myself only, this wouldn’t be a problem, but as soon as other team members use the function I would need to explain, document the function.

To avoid this, I now extended akrun’s solution to account for all four ways without unquoting. However, I am not sure whether this solution created other pitfalls.

sum_tbl <- function(df, x) {

    # if x is a symbol such as group without strings, than turn it into a string    
    if(is.symbol(get_expr(enquo(x))))  {

        x <- quo_name(enquo(x))

    # if x is a language object such as vars[1], evaluate it
    # (this turns it into a symbol), then turn it into a string
    } else if (is.language(get_expr(enquo(x))))  {

        x <- eval(x)
        x <- quo_name(enquo(x))

    } 

      # this part of the function works with normal strings as x
        sum_tbl <- df %>%
            dplyr::group_by(!! rlang::sym(x)) %>%
            dplyr::summarise(score = mean(score, na.rm = TRUE),
                             n = dplyr::n())

    return(sum_tbl)

}

Solution

  • We can just use group_by_at that can take a string as argument

    sum_tbl1 <- function(df, x) {
    
    
    
                df %>%
                    dplyr::group_by_at(x) %>%
                    dplyr::summarise(score = mean(score, na.rm =TRUE),
                                     n = dplyr::n())
    
            }
    

    and then call as

    out1 <- map(vars, ~ sum_tbl1(test_tbl, .x))
    

    Or another option is to convert to symbol and then evaluate (!!) within group_by

    sum_tbl2 <- function(df, x) {
    
    
    
                df %>%
                    dplyr::group_by(!! rlang::sym(x)) %>%
                    dplyr::summarise(score = mean(score, na.rm =TRUE),
                                     n = dplyr::n())
    
            }
    
    out2 <- map(vars, ~ sum_tbl2(test_tbl, .x))
    
    identical(out1 , out2)
    #[1] TRUE
    

    If we specify one of the parameters, we don't have to provide the second argument, thus can also run without anonymous call

    map(vars, sum_tbl2, df = test_tbl)
    

    Update

    If we wanted to use it with conditions mentioned in the updated OP's post

    sum_tbl3 <- function(df, x) {
    
               x1 <- enquo(x)
               x2 <- quo_name(x1)
    
                df %>%
                    dplyr::group_by_at(x2) %>%
                    dplyr::summarise(score = mean(score, na.rm =TRUE),
                                     n = dplyr::n())
    
            }
    
    
    sum_tbl3(test_tbl, group)
    # A tibble: 7 x 3
    #  group score     n
    #  <chr> <dbl> <int>
    #1 a      5.43   148
    #2 b      5.01   144
    #3 c      5.35   156
    #4 d      5.19   152
    #5 e      5.65    72
    #6 f      5.31    36
    #7 g      5.24    42
    
    sum_tbl3(test_tbl, "group")
    # A tibble: 7 x 3
    #  group score     n
    #  <chr> <dbl> <int>
    #1 a      5.43   148
    #2 b      5.01   144
    #3 c      5.35   156
    #4 d      5.19   152
    #5 e      5.65    72
    #6 f      5.31    36
    #7 g      5.24    42
    

    or call from 'vars'

    sum_tbl3(test_tbl, !!vars[1])
    # A tibble: 7 x 3
    #  group score     n
    #  <chr> <dbl> <int>
    #1 a      5.43   148
    #2 b      5.01   144
    #3 c      5.35   156
    #4 d      5.19   152
    #5 e      5.65    72
    #6 f      5.31    36
    #7 g      5.24    42
    

    and with map

    map(vars, ~ sum_tbl3(test_tbl, !!.x))
    #[[1]]
    # A tibble: 7 x 3
    #  group score     n
    #  <chr> <dbl> <int>
    #1 a      5.43   148
    #2 b      5.01   144
    #3 c      5.35   156
    #4 d      5.19   152
    #5 e      5.65    72
    #6 f      5.31    36
    #7 g      5.24    42
    
    #[[2]]
    # A tibble: 7 x 3
    #  group2 score     n
    #  <chr>  <dbl> <int>
    #1 a       5.43   148
    #2 b       5.01   144
    #3 c       5.35   156
    #4 d       5.19   152
    #5 e       5.65    72
    #6 f       5.31    36
    #7 g       5.24    42