Search code examples
rfunctiondplyracross

mutate(across()) with external function that references other variables in current data frame without passing second argument


I want to mutate() multiple variables with across()

  1. using a function defined beforehand
  2. that references other variables in the data frame but
  3. needs only one argument (the variable to be mutated) and
  4. does not hard code the environment for those variables inside the function.

For example, this code will add the variable x to each of y and z:

library(dplyr)

# Data to modify
dtmp = tibble(x = 1:4, y = 10, z = 20)

# Function to pass to mutate(across())
addx = function(col, added){col + added}

# Any of these works
dtmp %>% mutate(across(c(y,z), addx, added=x))
dtmp %>% mutate(across(c(y,z), ~addx(.x, x)))
dtmp %>% mutate(across(c(y,z), function(var){addx(var, x)}))

It is possible to avoid passing a second argument to addx inside mutate(across()) by hard-coding the reference to dtmp$x in the global environment:

addx = function(col){col + dtmp$x}
dtmp %>% mutate(across(c(y,z), addx))

However, this solution is risky. For example, it would fail to behave as desired if the data frame were grouped (by some 4th variable) before the mutate call because dtmp$x would not be the same length as the subset of y or z within the group.

It seems like it should be possible to write addx so that we do not have to pass a second argument to it inside mutate(across()) and do not have to hard-code dtmp$x inside the function definition. Is this possible? In other words, is there a something(x) that will make the x expression inside the definition of addx() be evaluated in the environment of the current data frame (as defined inside mutate(across(data,...)))?

The structure of a solution would look like

addx = function(col){col + Something(x)}
dtmp %>% mutate(across(c(y,z), addx))

Example use case: some functions we might use to modify variables may reference many other variables inside the data frame and those functions may be used many times in the code. Writing out arg1=var1, arg2=var2, arg3=var3,... is a mess.


Solution

  • You can extract x value from cur_data() which would also work when you group the data.

    library(dplyr)
    
    dtmp = tibble(x = 1:4, y = 10, z = 20)
    
    # Function to pass to mutate(across())
    addx = function(col) {col + cur_data()$x}
    
    dtmp %>% mutate(across(c(y,z), addx))
    
    #      x     y     z
    #  <int> <dbl> <dbl>
    #1     1    11    21
    #2     2    12    22
    #3     3    13    23
    #4     4    14    24
    

    If you need the function to reference a grouping variable, use cur_data_all(), instead.