I'm wondering what the correct approach is to creating an Apache Arrow multi-file dataset as described here in batches. The tutorial explains quite well how to write a new partitioned dataset from data in memory, but is it possible to do this in batches?
My current approach is to simply write the datasets individually, but to the same directory. This appears to be working, but I have to imagine this causes issues with the metadata that powers the feature. Essentially, my logic is as follows (pseudocode):
data_ids <- c(123, 234, 345, 456, 567)
# write data in batches
for (id in data_ids) {
## assume this is some complicated computation that returns 1,000,000 records
df <- data_load_helper(id)
df <- group_by(df, col_1, col_2, col_3)
arrow::write_dataset(df, "arrow_dataset/", format = 'arrow')
}
# read in data
dat <- arrow::open_dataset("arrow_dataset/", format="arrow", partitioning=c("col_1", "col_2", "col_3"))
# check some data
dat %>%
filter(col_1 == 123) %>%
collect()
What is the correct way of doing this? Or is my approach correct? Loading all of the data into one object and then writing it at once is not viable, and certain chunks of the data will update at different periods over time.
TL;DR: Your solution looks pretty reasonable.
There may be one or two issues you run into. First, if your batches do not all have identical schemas then you will need to make sure to pass in unify_schemas=TRUE
when you are opening the dataset for reading. This could also become costly and you may want to just save the unified schema off on its own.
certain chunks of the data will update at different periods over time.
If by "update" you mean "add more data" then you may need to supply a basename_template
. Otherwise every call to write_dataset
will try and create part-0.arrow
and they will overwrite each other. A common practice to work around this is to include some kind of UUID in the basename_template
.
If by "update" you mean "replace existing data" then things will be a little trickier. If you want to replace entire partitions worth of data you can use existing_data_behavior="delete_matching"
. If you want to replace matching rows I'm not sure there is a great solution at the moment.
This approach could also lead to small batches, depending on how much data is in each group in each data_id. For example, if you have 100,000 data ids and each data id has 1 million records spread across 1,000 combinations of col_1/col_2/col_3 then you will end up with 1 million files, each with 1,000 rows. This won't perform well. Ideally you'd want to end up with 1,000 files, each with 1,000,000 rows. You could perhaps address this with some kind of occasional compaction step.