Search code examples
rdplyrrlangtidyeval

Write function that access data from dplyr context


Disclaimer: this is a very elemental question. I'll use an example to make it easier, but the question has nothing to do with the example itself.

Supose you have a dataframe df:

# A tibble: 5 × 4
  index     a     b     c
  <int> <int> <dbl> <dbl>
1     1     0     0     1
2     2     1     0     0
3     3     0     1     0
4     4     0     1     0
5     5     1     0     0

And you want to gather the dummies into a single factor column. Getting inspiration from eatATA::dummiesToFactor(), you could use something like:

dum2fac <- function(data) { factor(names(data)[max.col(data)]) }

df %>% mutate(name = dum2fac(across(a:c)))

# A tibble: 5 × 5
  index     a     b     c name 
  <int> <int> <dbl> <dbl> <fct>
1     1     0     0     1 c    
2     2     1     0     0 a    
3     3     0     1     0 b    
4     4     0     1     0 b    
5     5     1     0     0 a 

Now suppose you want to modify dum2fac() to allow for something like the following:

df %>% mutate(name = dum2fac(a:c))

I tried one specific path, and from that my "more elemental" question appeared. This was what I tried:

dum2fac <- function(expr) {
  data <- select(???, {{expr}})
  factor(names(data)[max.col(data)])}

Where a:c will be passed onto expr, and ??? should stand for "the dataset that is being used in the dplyr context". Another way of putting it: across(a:c) doesn't refer directly to the dataset df, it just know that it needs to access it because of the context where it is used, and I want my function to be able to do the same.

Some concepts I figured could help were the "rlang fake data pronoun" .data, and some higher order functions/objects that are used in across and mutate, like the R6 object DataMask, peek_mask(), and others that probably aren't a good practice to use even if possible.

Obs: I'm glad to hear if you have a better path to rewrite dum2fac(), please add it too. But again, that's not exactly what this question is about.

Dummy data:

set.seed(2023)
df <- tibble(index = 1:5,
             a = sample(0:1, 5, TRUE),
             b = (1 - a) * sample(0:1, 5, TRUE),
             c = 1 - a - b)

Solution

  • You can use across() or (more idiomatically) pick() inside your own function:

    library(dplyr)
    set.seed(2023)
    
    df <- tibble(
      index = 1:5,
      a = sample(0:1, 5, TRUE),
      b = (1 - a) * sample(0:1, 5, TRUE),
      c = 1 - a - b
    )
    
    dum2fac <- function(expr) {
      data <- pick({{ expr }})
      factor(names(data)[max.col(data)])
    }
    
    df %>% mutate(name = dum2fac(a:c))
    #> # A tibble: 5 × 5
    #>   index     a     b     c name 
    #>   <int> <int> <dbl> <dbl> <fct>
    #> 1     1     0     0     1 c    
    #> 2     2     1     0     0 a    
    #> 3     3     0     1     0 b    
    #> 4     4     0     1     0 b    
    #> 5     5     1     0     0 a
    

    If you want the full data without selections, use pick(everything()).