Search code examples
rapache-arrow

Is there a way to combine two data while maintaining the Arrow format in R?


Is there a way to combine 2 data read by an Arrow package! However, I would like to keep the arrow format intact. There may be a way to do it like below, but it takes too much time to frame the data and then apply the rbindlist function, hence my question.
install.packages("arrow")
library("arrow")

a <- arrow::open_dataset("a.parquet")
b <- arrow::open_dataset("b.parquet")
a1 <- as.data.frame(a)
b1 <- as.data.frame(b)
Spending too long time if the size of a, b is very big.
merge <- rbindlist(list(a1, b1))
Please new idea

I look forward to a quick way to combine both data in Arrow format, or even if you don't.


Solution

  • The arrow package supports partitioning reading multiple parquet files at once which may achieve what you are after (see note below about partitioning from @r2evans). That is assuming the datasets have an identical schema then you can open multiple files in a single call to open_dataset which will then be treated as if they were a single file e.g.

    library(arrow)
    library(dplyr)
    
    file1 <- tempfile()
    file2 <- tempfile()
    write_parquet(iris, file1)
    write_parquet(iris, file2)
    
    files <- c(file1, file2)
    x <- open_dataset(files)
    x |>
        select(Sepal.Length) |>
        nrow()
    

    Full details can be found in the vignette