Search code examples
rparquetapache-arrow

Read partitioned parquet directory (all files) in one R dataframe with apache arrow


How do I read a partitioned parquet file into R with arrow (without any spark)

The situation

  1. created parquet files with a Spark pipe and save on S3
  2. read with RStudio/RShiny with one column as index to do further analysis

The parquet file structure

The parquet files created from my Spark consists of several parts

tree component_mapping.parquet/
component_mapping.parquet/
├── _SUCCESS
├── part-00000-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00001-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00002-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00003-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00004-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── etc

How do I read this component_mapping.parquet into R?

What I tried

install.packages("arrow")
library(arrow)
my_df<-read_parquet("component_mapping.parquet")

but this fails with the error

IOError: Cannot open for reading: path 'component_mapping.parquet' is a directory

It works if I just read one file of the directory

install.packages("arrow")
library(arrow)
my_df<-read_parquet("component_mapping.parquet/part-00000-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet")

but I need to load all in order to query on it

What I found in the documentation

In the apache arrow documentation https://arrow.apache.org/docs/r/reference/read_parquet.html and https://arrow.apache.org/docs/r/reference/ParquetReaderProperties.html I found that there area some properties for the read_parquet() command but I can't get it working and do not find any examples.

read_parquet(file, col_select = NULL, as_data_frame = TRUE, props = ParquetReaderProperties$create(), ...)

How do I set the properties correctly to read the full directory?

# should be this methods
$read_dictionary(column_index)
or
$set_read_dictionary(column_index, read_dict)

Help would be very appreciated


Solution

  • As @neal-richardson alluded to in his answer, more work has been done on this, and with the current arrow package (I'm running 4.0.0 currently) this is possible.

    I noticed your files used snappy compression, which requires a special build flag before installation. (Installation documentation here: https://arrow.apache.org/docs/r/articles/install.html)

    Sys.setenv("ARROW_WITH_SNAPPY" = "ON")
    install.packages("arrow",force = TRUE)
    

    The Dataset API implements the functionality you are looking for, with multi-file datasets. While the documentation does not yet include a wide variety of examples, it does provide a clear starting point. https://arrow.apache.org/docs/r/reference/Dataset.html

    The example below shows a minimal example of reading a multi-file dataset from a given directory and converting it to an in-memory R data frame. The API also supports filtering criteria and selecting a subset of columns, though I'm still trying to figure out the syntax myself.

    library(arrow)
    
    ## Define the dataset
    DS <- arrow::open_dataset(sources = "/path/to/directory")
    ## Create a scanner
    SO <- Scanner$create(DS)
    ## Load it as n Arrow Table in memory
    AT <- SO$ToTable()
    ## Convert it to an R data frame
    DF <- as.data.frame(AT)