Search code examples
rparquetapache-arrow

write multiple parquet files by chunk_size


I see the chunk_size argument in arrow::write_parquet(), but it doesn't seem to behave as expected. I would expect the code below to generate 3 separate parquet files, but only one is created, and nrow > chunk_size.

library(arrow)
# .parquet dir and file path
td <- tempdir()
tf <- tempfile("", td, ".parquet")
on.exit(unlink(tf))

# dataframe with 3e6 rows
n  <- 3e6 
df <- data.frame(x = rnorm(n))

# write with chunk_size 1e6, and view directory
write_parquet(df, tf, chunk_size = 1e6)
list.files(td)

Returns one file instead of 3:

[1] "25ff74854ba6.parquet"  
# read parquet and show all rows are there
nrow(read_parquet(tf))

Returns:

[1] 3000000

We can't pass multiple file name arguments to write_parquet(), and I don't want to partition, so write_dataset() also seem inapplicable.


Solution

  • The chunk_size parameter refers to how much data to write to disk at once, rather than the number of files produced. The write_parquet() function is designed to write individual files, whereas, as you said, write_dataset() allows partitioned file writing. I don't believe that splitting files on any other basis is supported at the moment, though it is a possibility in future releases. If you had a specific reason for wanting 3 separate files, I'd recommend separating the data into multiple datasets first and then writing each of those via write_parquet().

    (Also, I am one of the devs on the R package, and can see that this isn't entirely clear from the docs, so I'm going to open a ticket to update those - thanks for flagging this up)