Working in R, I would like Arrow to sum a set of columns specified in a variable.
library(arrow)
library(dplyr)
example_data = InMemoryDataset$create(data.frame(a1 = c(1,2,3), b2=c(4,5,6), c3=c(7,8,9)))
cols_to_sum = c('a1','b2','c3')
Arrow is capable of doing this:
example_data %>% mutate(computed_sum = a1+b2+c3) %>% compute()
#Succeeds
However I would like to pass the variable rather than specifying the columns explicitly. The dplyr syntax I'd usually use for this does not work with Arrow:
example_data %>%
mutate(computed_sum = rowSums(across(all_of(cols_to_sum)))) %>%
compute()
#Error: Expression rowSums(across(all_of(cols_to_sum))) not supported in Arrow
#Call collect() first to pull data into R.
Reconstructing the literal input string with parse() and eval() does work but seems like a cumbersome workaround for what should be a common operation:
temp_expression = parse( text=paste(cols_to_sum, collapse = '+') )
example_data %>%
mutate(computed_sum = eval(temp_expression) ) %>%
compute()
#Succeeds
However the above process without an explicit temporary variable fails:
example_data %>%
mutate(computed_sum = eval( parse( text=paste(cols_to_sum, collapse = '+') ) ) ) %>%
compute()
#Error: Expression eval(parse(text = paste(cols_to_sum, collapse = "+"))) not supported in Arrow
#Call collect() first to pull data into R.
What is the correct/best/intended way to use Arrow's R interface to specify recursive computations (e.g., sum) on columns listed in a variable? Do I need to build strings and eval() them to make this happen?
Non-Arrow solutions won't work for me. I am working with data far too large for memory, distributed as hive-partitioned parquets and accessed by Arrow's open_dataset().
I'm not sure why, but if you store the recursive code in a function (named or anonymous), it will let you run recursive code (or more simply written with Reduce
):
library(arrow)
library(dplyr)
example_data = InMemoryDataset$create(data.frame(a1 = c(1,2,3), b2=c(4,5,6), c3=c(7,8,9)))
cols_to_sum = c('a1','b2','c3')
f <- function(...) Reduce(`+`, list(...))
example_data %>%
mutate(computed_sum = f(!!!syms(cols_to_sum))) %>%
collect()
#> a1 b2 c3 computed_sum
#> 1 1 4 7 12
#> 2 2 5 8 15
#> 3 3 6 9 18
# calling directly errors out
example_data %>% mutate(computed_sum = Reduce(`+`, syms(cols_to_sum)))
#> Error: Expression Reduce(`+`, syms(cols_to_sum)) not supported in Arrow
#> Call collect() first to pull data into R.
# anonymous functions do work
example_data %>% mutate(computed_sum = (function(...) Reduce(`+`, list(...)))(!!!syms(cols_to_sum)))
#> InMemoryDataset (query)
#> a1: double
#> b2: double
#> c3: double
#> computed_sum: double (add_checked(add_checked(a1, b2), c3))
#>
#> See $.data for the source Arrow object