I have opened a .parquet dataset through the open_dataset
function of the arrow
package. I want to use across
to clean several numeric columns at a time. However, when I run this code:
start_numeric_cols = "sum"
sales <- sales %>% mutate(
across(starts_with(start_numeric_cols) & (!where(is.numeric)),
\(col) {replace(col, col == "NULL", 0) %>% as.numeric()}),
across(starts_with(start_numeric_cols) & (where(is.numeric)),
\(col) {replace(col, is.na(col), 0)})
)
#> Error in `across_setup()`:
#> ! Anonymous functions are not yet supported in Arrow
The error message is pretty informative, but I am wondering whether there is any way to do the same only with dplyr
verbs within across
(or another workaround without having to type each column name).
arrow
has a growing set of functions that can be used without pulling the data into R (available here) but replace()
is not yet supported. However, you can use ifelse()
/if_else()
/case_when()
. Note also that purrr-style lambda functions are supported where regular anonymous functions are not.
I don't have your data so will use the iris
dataset as an example to demonstrate that the query builds successfully, even if it doesn't make complete sense in the context of this data.
library(arrow)
library(dplyr)
start_numeric_cols <- "P"
iris %>%
as_arrow_table() %>%
mutate(
across(
starts_with(start_numeric_cols) & (!where(is.numeric)),
~ as.numeric(if_else(.x == "NULL", 0, .x))
),
across(
starts_with(start_numeric_cols) & (where(is.numeric)),
~ if_else(is.na(.x), 0, .x)
)
)
Table (query)
Sepal.Length: double
Sepal.Width: double
Petal.Length: double (if_else(is_null(Petal.Length, {nan_is_null=true}), 0, Petal.Length))
Petal.Width: double (if_else(is_null(Petal.Width, {nan_is_null=true}), 0, Petal.Width))
Species: dictionary<values=string, indices=int8>
See $.data for the source Arrow object