Search code examples
rdplyrapache-arrowacross

How to write anonymous functions in R arrow across


I have opened a .parquet dataset through the open_dataset function of the arrow package. I want to use across to clean several numeric columns at a time. However, when I run this code:

start_numeric_cols = "sum"
sales <- sales %>% mutate(
  across(starts_with(start_numeric_cols) & (!where(is.numeric)), 
         \(col) {replace(col, col == "NULL", 0) %>% as.numeric()}),
  across(starts_with(start_numeric_cols) & (where(is.numeric)),
         \(col) {replace(col, is.na(col), 0)})
)
#> Error in `across_setup()`:
#> ! Anonymous functions are not yet supported in Arrow

The error message is pretty informative, but I am wondering whether there is any way to do the same only with dplyr verbs within across (or another workaround without having to type each column name).


Solution

  • arrow has a growing set of functions that can be used without pulling the data into R (available here) but replace() is not yet supported. However, you can use ifelse()/if_else()/case_when(). Note also that purrr-style lambda functions are supported where regular anonymous functions are not.

    I don't have your data so will use the iris dataset as an example to demonstrate that the query builds successfully, even if it doesn't make complete sense in the context of this data.

    library(arrow)
    library(dplyr)
    
    start_numeric_cols <- "P"
    
    iris %>%
      as_arrow_table() %>%
      mutate(
        across(
        starts_with(start_numeric_cols) & (!where(is.numeric)),
        ~ as.numeric(if_else(.x == "NULL", 0, .x))
      ),
      across(
        starts_with(start_numeric_cols) & (where(is.numeric)),
        ~ if_else(is.na(.x), 0, .x)
      )
    )
    
    Table (query)
    Sepal.Length: double
    Sepal.Width: double
    Petal.Length: double (if_else(is_null(Petal.Length, {nan_is_null=true}), 0, Petal.Length))
    Petal.Width: double (if_else(is_null(Petal.Width, {nan_is_null=true}), 0, Petal.Width))
    Species: dictionary<values=string, indices=int8>
    
    See $.data for the source Arrow object