Search code examples
rdplyrtidyeval

Passing (function) user-specified column name to dplyr do()


Original question

Can anyone explain to me why unquote does not work in the following?

I want to pass on a (function) user-specified column name in a call to do in version 0.7.4 of dplyr. This does seem somewhat less awkward than the older standard evaluation approach using do_. A basic (successful) example ignoring the fact that using do here is very unnecessary would be something like:

sum_with_do <- function(D, x, ...) {
    x <- rlang::ensym(x)
    gr <- quos(...)

    D %>%
        group_by(!!! gr) %>%
        do(data.frame(y=sum(.[[quo_name(x)]])))
}

D <- data.frame(group=c('A','A','B'), response=c(1,2,3))
sum_with_do(D, response, group)

# A tibble: 2 x 2
# Groups:   group [2]
  group     y
  <fct> <dbl>
1 A        3.
2 B        3.

The rlang:: is unnecessary as of dplyr 0.7.5 which now exports ensym. I have included lionel's suggestion regarding using ensym here rather than enquo, as the former guarantees that the value of x is a symbol (not an expression).

Unquoting not useful here (e.g. other dplyr examples), replacing quo_name(x) with !! x in the above produces the following error:

Error in ~response : object 'response' not found

Explanation

As per the accepted response, the underlying reason is that do does not evaluate the expression in the same environment that other dplyr functions (e.g. mutate) use.

I did not find this to be abundantly clear from either the documentation or the source code (e.g. compare the source for mutate and do for data.frames and follow Alice down the rabbit hole if you wish), but essentially - and this is probably nothing new to most;

  • do evaluates expressions in an environment whose parent is the calling environment, and attaches the current group (slice) of the data.frame to the symbol ., and;
  • other dplyr functions 'more or less' evaluate the expressions in the environment of the data.frame with parent being the calling environment.

See also Advanced R. 22. Evaluation for a description in terms of 'data masking'.


Solution

  • This is because of regular do() semantics where there is no data masking apart from .:

    do(df, data.frame(y = sum(.$response)))
    #>   y
    #> 1 6
    
    do(df, data.frame(y = sum(.[[response]])))
    #> Error: object 'response' not found
    

    So you just need to capture the bare column name as a string and there is no need to unquote since there is no data masking:

    sum_with_do <- function(df, x, ...) {
      # ensym() guarantees that `x` is a simple column name and not a
      # complex expression:
      x <- as.character(ensym(x))
    
      df %>%
        group_by(...) %>%
        do(data.frame(y = sum(.[[x]])))
    }