Search code examples
rfunctiondplyr

Passing dplyr methods as function arguments


I have a general piece of code A that comes up repeatedly in a series of programs. Each instance of A assumes the form

output_data = input_data %>%
              do common operations %>%
              do specific dplyr methods that vary from instance to instance %>%
              do more common operations
              

Due to the repeated calls to A, it makes sense to wrap this code in a function. In order to handle the instance-specific dplyr method calls, I want to pass the dplyr methods into the function as arguments. As such, I was wondering how you can pass multiple dplyr methods, each with arbitrary numbers of conditions, into a function in a succinct way.

It is not too hard to pass a single dplyr method into a function with an arbitrary number of arguments i.e

insert_dplyr_method = function(input_data, dplyr_method, ...) {

    output_data = input_data %>%
                  dplyr_method(...)

    return(output_data)

}

Test

dframe = data.frame(start = c(1,1,1,2,2,3,3,3,3), 
                    middle = sample(1:9), 
                    end = c(1,2,3,1,2,1,2,3,4))

dframe_1 = insert_dplyr_method(dframe, 
                               dplyr::filter, 
                               start == 1, 
                               end == 2)

dframe_2 = insert_dplyr_method(dframe, 
                               dplyr::select, 
                               all_of("start"))

What I would really like to do is to pass in n dplyr methods, each with an arbitrary number of arguments i.e for the n = 2 case something like

insert_dplyr_method_2 = function(input_data, dplyr_method_1, ...1, dplyr_method_2, ...2) {

    output_data = input_data %>%
                  dplyr_method_1(..._1) %>%
                  dplyr_method_2(..._2)

    return(output_data)

}

The only way I could think of to do this would require passing the dplyr methods and their corresponding ellipsis into the function in a list i.e

dplyr_methods = list(c(dplyr_method_1, ...), c(dplyr_method_2, ...), etc.)

and then using the do.call() method (see here, here and here) though I was unable to get it to work.

I was wondering if anyone could show me how to do this? I'm also open to better approaches if anyone knows of one.


Solution

  • 1) Instead of passing the functions and arguments of the varying portions pass a pipeline with the arguments already filled in to the main function, do_all. Below we use your example except we have added a non-varying sum at the end to show how that works. Note that . %>% whatever is magrittr syntax for defining a function which passes the input to whatever.

    library(dplyr)
    set.seed(123)
    
    do_all <- function(data, fun) data %>% fun %>% sum  # main function
    
    dframe = data.frame(start = c(1,1,1,2,2,3,3,3,3), 
                        middle = sample(1:9), 
                        end = c(1,2,3,1,2,1,2,3,4))
    
    fun <- . %>% filter(start == 1, end == 2) %>% select(start)
    do_all(dframe, fun)
    ## [1] 1
    
    fun <- . %>% filter(start == 1, end == 2) %>% select(end)
    do_all(dframe, fun)
    ## [1] 2
    

    2) Alternately define the pre and post processing pipelines and then just run the entire pipeline each time. pre and post are the non-varying portions.

    pre <- . %>% identity
    post <- . %>% sum
    
    dframe %>% pre %>% filter(start == 1, end == 2) %>% select(start) %>% post
    ## [1] 1
    
    dframe %>% pre %>% filter(start == 1, end == 2) %>% select(end) %>% post
    ## [1] 2