I see the chunk_size
argument in arrow::write_parquet()
, but it doesn't seem to behave as expected. I would expect the code below to generate 3 separate parquet files, but only one is created, and nrow > chunk_size
.
library(arrow)
# .parquet dir and file path
td <- tempdir()
tf <- tempfile("", td, ".parquet")
on.exit(unlink(tf))
# dataframe with 3e6 rows
n <- 3e6
df <- data.frame(x = rnorm(n))
# write with chunk_size 1e6, and view directory
write_parquet(df, tf, chunk_size = 1e6)
list.files(td)
Returns one file instead of 3:
[1] "25ff74854ba6.parquet"
# read parquet and show all rows are there
nrow(read_parquet(tf))
Returns:
[1] 3000000
We can't pass multiple file name arguments to write_parquet()
, and I don't want to partition, so write_dataset()
also seem inapplicable.
The chunk_size
parameter refers to how much data to write to disk at once, rather than the number of files produced. The write_parquet()
function is designed to write individual files, whereas, as you said, write_dataset()
allows partitioned file writing. I don't believe that splitting files on any other basis is supported at the moment, though it is a possibility in future releases. If you had a specific reason for wanting 3 separate files, I'd recommend separating the data into multiple datasets first and then writing each of those via write_parquet()
.
(Also, I am one of the devs on the R package, and can see that this isn't entirely clear from the docs, so I'm going to open a ticket to update those - thanks for flagging this up)