Search code examples
rpurrrapache-arrow

Using Apache Arrow or DuckDB inside a map function in R


I am analysing a large dataset saved as a hive-partitioned set of parquet files using the apache arrow package. The whole data is too large to load into memory, so I would like to process it in chunks using a map function (and eventually a future_map so that I can run it in parallel. Unfortunately I get an error message whenever I try to use the arrow dataset functions inside a map call.

Below is a reprex using the diamonds dataset.

library(arrow)
library(tidyverse)
library(duckdb)

# Load the colours dataset
data(diamonds)
colors = diamonds$color %>% unique()

# Write the data to a hive dataset
diamonds %>%
  write_dataset(path = 'diamonds', partitioning = c('color', 'clarity'))
rm(diamonds)

# Load it back in with Arrow and convert to DuckDB
diamonds = open_dataset('diamonds') %>%
  to_duckdb()

# Check it works (prints a table filtered to E)
diamonds %>% filter(color == "E") 

# Get separate table from each colour
colors %>%
  map(function(x) {

    diamonds %>%
      filter(color == x) %>%
      collect()
    
  })

This gives the following error message:

Error in `map()`:
ℹ In index: 1.
Caused by error in `collect()`:
! Failed to collect lazy table.
Caused by error:
! rapi_execute: Failed to run query
Error: Conversion Error: Could not convert string 'D' to DOUBLE
LINE 3: WHERE (color = x)

If I run without converting to a duckDB then I get a similar error:

Error in `map()`:
ℹ In index: 1.
Caused by error in `compute.arrow_dplyr_query()`:
! NotImplemented: Function 'equal' has no kernel matching input types (string, double)

Any ideas how I can overcome this issue?


Solution

  • x is a name of a column in diamonds. Use .env$x or !!x.