Read partitioned parquet directory (all files) in one R dataframe with apache arrow

How do I read a partitioned parquet file into R with arrow (without any spark)

The situation

created parquet files with a Spark pipe and save on S3
read with RStudio/RShiny with one column as index to do further analysis

The parquet file structure

The parquet files created from my Spark consists of several parts

tree component_mapping.parquet/
component_mapping.parquet/
├── _SUCCESS
├── part-00000-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00001-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00002-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00003-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00004-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── etc

How do I read this component_mapping.parquet into R?

What I tried

install.packages("arrow")
library(arrow)
my_df<-read_parquet("component_mapping.parquet")

but this fails with the error

IOError: Cannot open for reading: path 'component_mapping.parquet' is a directory

It works if I just read one file of the directory

install.packages("arrow")
library(arrow)
my_df<-read_parquet("component_mapping.parquet/part-00000-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet")

but I need to load all in order to query on it

What I found in the documentation

In the apache arrow documentation https://arrow.apache.org/docs/r/reference/read_parquet.html and https://arrow.apache.org/docs/r/reference/ParquetReaderProperties.html I found that there area some properties for the read_parquet() command but I can't get it working and do not find any examples.

read_parquet(file, col_select = NULL, as_data_frame = TRUE, props = ParquetReaderProperties$create(), ...)

How do I set the properties correctly to read the full directory?

# should be this methods
$read_dictionary(column_index)
or
$set_read_dictionary(column_index, read_dict)

Help would be very appreciated

Solution

As @neal-richardson alluded to in his answer, more work has been done on this, and with the current arrow package (I'm running 4.0.0 currently) this is possible.

I noticed your files used snappy compression, which requires a special build flag before installation. (Installation documentation here: https://arrow.apache.org/docs/r/articles/install.html)

Sys.setenv("ARROW_WITH_SNAPPY" = "ON")
install.packages("arrow",force = TRUE)

The Dataset API implements the functionality you are looking for, with multi-file datasets. While the documentation does not yet include a wide variety of examples, it does provide a clear starting point. https://arrow.apache.org/docs/r/reference/Dataset.html

The example below shows a minimal example of reading a multi-file dataset from a given directory and converting it to an in-memory R data frame. The API also supports filtering criteria and selecting a subset of columns, though I'm still trying to figure out the syntax myself.

library(arrow)

## Define the dataset
DS <- arrow::open_dataset(sources = "/path/to/directory")
## Create a scanner
SO <- Scanner$create(DS)
## Load it as n Arrow Table in memory
AT <- SO$ToTable()
## Convert it to an R data frame
DF <- as.data.frame(AT)