Search code examples
rdplyrtidyeval

Tidyeval embrace doesn't work with default value


I have a function perc_diff that I use within dplyr's mutate. It calculates relative differences from the first value in group by default. But it can also work with mean, max, nth or any function that returns one value to compare others too.

perc_diff <- function(num, fun = first, ...) {
    (num - fun(num, ...)) / fun(num, ...) * 100
}

Sometimes, I need more control over which group to compare to. In that case I order the data.frame by detecting a pattern and then use first.

test_data <- data.frame(group = paste0("group_", rep(LETTERS[1:3], 3)), value = 1:9, other = rep(1:3, each = 3)) %>%
arrange(rnorm(9)) 

test_data %>%
group_by(other) %>%
arrange(other, desc(str_detect(group, "A$"))) %>%
mutate(pdiff = perc_diff(value))

I wanted to skip the step of arranging and build it into the function and also have it return NAs, if it cannot find the control group. I made a get_control_value function that perc_diff could use instead of first. I used the embrace technique for programming with dplyr to get the test group column.

get_control_value <- function(value, test_group_column = test_group, control_group_pattern = "A$") {
    test_vector <- stringr::str_detect({{test_group_column}}, control_group_pattern)
    if (sum(test_vector) == 1) {
        value[test_vector]
    } else {
        NA
    }
}

It works well if I give it the value for test_group_column.

test_data %>%
group_by(other) %>%
mutate(pdiff = perc_diff(value, get_control_value, test_group_column = group)) %>%
arrange(other, group)

But it doesn't work with default value.

test_data %>%
rename(group = test_group) %>%
group_by(other) %>%
mutate(pdiff = perc_diff(value, get_control_value)) %>%
arrange(other, group)

My question is - why does it not work with default value? I'm guessing it has something to do with str_detect not being a proper quasiquotation context. But why then does it work if I give it the value manually? Because I do it within mutate?

Anyway, I know there are many ways to work around this, the first being to just skip the default value and always enter it in. But I would still like to know if there is some way to specify the default so it would work too.


Solution

  • Just think what would happen if you called just

    perc_diff(5, get_control_value)
    

    What would the default value be? There is no mutate() so there is no column named "test_group". As written, the perc_diff function doesn't know that it's meant to be run inside a mutate(). It's not aware of the "data context." There's no place for the get_control_value function to look up the values for the groups. Since str_detect doesn't understand quasinotation, passing {{test_group}} is just the same as passing test_group. The braces do nothing. Just like {{5}} is the same as 5 outside the rlang syntax. You could remove the braces and it would behave the same.

    When you call

    perc_diff(value, get_control_value, test_group_column = group)
    

    You are not passing in the name of the column, you are actually passing in the values of the column. (again, since {{}} does nothing for str_detect). When you call functions in R, variables are looked up according to lexical scoping. This means that values come from where functions are defined, not where they are called. This means that all the values you want your function inside mutate() to use need to be passed in. The called function doesn't have access to the data frame because it doesn't fall in the lexical scope tree.

    Because of the way functions are nested, it's not really easy to walk up the call stack to find where the data may be coming from. So the rule is, if your function needs values from your data frame, you need to pass them in as a parameter.

    But in this particular case, you could technically do

    get_control_value <- function(value, test_group_column = eval.parent(quote(test_group), 2), control_group_pattern = "A$") {
      test_vector <- stringr::str_detect(test_group_column, control_group_pattern)
      if (sum(test_vector) == 1) {
        value[test_vector]
      } else {
        NA
      }
    }
    

    which would go up the call stack, but this is really a hack. The nesting of function calls isn't necessarily guaranteed and it prevents you from calling the function in any other context.