Search code examples
rapache-arrow

dplyr syntax for arrow to sum columns specified in a variable


Working in R, I would like Arrow to sum a set of columns specified in a variable.

library(arrow) 
library(dplyr)

example_data = InMemoryDataset$create(data.frame(a1 = c(1,2,3), b2=c(4,5,6), c3=c(7,8,9)))
cols_to_sum = c('a1','b2','c3')

Arrow is capable of doing this:

example_data %>% mutate(computed_sum = a1+b2+c3)  %>% compute()

#Succeeds

However I would like to pass the variable rather than specifying the columns explicitly. The dplyr syntax I'd usually use for this does not work with Arrow:

example_data %>% 
  mutate(computed_sum = rowSums(across(all_of(cols_to_sum))))  %>% 
  compute()

#Error: Expression rowSums(across(all_of(cols_to_sum))) not supported in Arrow
#Call collect() first to pull data into R.

Reconstructing the literal input string with parse() and eval() does work but seems like a cumbersome workaround for what should be a common operation:

temp_expression =  parse( text=paste(cols_to_sum, collapse = '+') )
example_data %>% 
  mutate(computed_sum = eval(temp_expression) )  %>% 
  compute()

#Succeeds

However the above process without an explicit temporary variable fails:

example_data %>% 
  mutate(computed_sum = eval( parse( text=paste(cols_to_sum, collapse = '+') ) ) )  %>% 
  compute()

#Error: Expression eval(parse(text = paste(cols_to_sum, collapse = "+"))) not supported in Arrow                                                                               
#Call collect() first to pull data into R. 

What is the correct/best/intended way to use Arrow's R interface to specify recursive computations (e.g., sum) on columns listed in a variable? Do I need to build strings and eval() them to make this happen?

Non-Arrow solutions won't work for me. I am working with data far too large for memory, distributed as hive-partitioned parquets and accessed by Arrow's open_dataset().


Solution

  • I'm not sure why, but if you store the recursive code in a function (named or anonymous), it will let you run recursive code (or more simply written with Reduce):

    library(arrow) 
    library(dplyr)
    
    example_data = InMemoryDataset$create(data.frame(a1 = c(1,2,3), b2=c(4,5,6), c3=c(7,8,9)))
    cols_to_sum = c('a1','b2','c3')
    
    f <- function(...) Reduce(`+`, list(...))
    
    example_data %>%
      mutate(computed_sum = f(!!!syms(cols_to_sum))) %>%
      collect()
    #>   a1 b2 c3 computed_sum
    #> 1  1  4  7           12
    #> 2  2  5  8           15
    #> 3  3  6  9           18
    
    # calling directly errors out
    example_data %>% mutate(computed_sum = Reduce(`+`, syms(cols_to_sum)))
    #> Error: Expression Reduce(`+`, syms(cols_to_sum)) not supported in Arrow
    #> Call collect() first to pull data into R.
    
    # anonymous functions do work
    example_data %>% mutate(computed_sum = (function(...) Reduce(`+`, list(...)))(!!!syms(cols_to_sum)))
    #> InMemoryDataset (query)
    #> a1: double
    #> b2: double
    #> c3: double
    #> computed_sum: double (add_checked(add_checked(a1, b2), c3))
    #> 
    #> See $.data for the source Arrow object