I have a dataset called df
where I have year, month and day variables. I would like to use the write_dataset
function to output a folder with the standard arrow dataset syntax as in the following image:
Within each folder there will be month=1, month=2, and so on.
Now, in order to create this I have used the following code:
df <- df %>% group_by(year, month, day)
output_folder = "my/path"
arrow::write_dataset(df,
output_folder,
format = "parquet",
)
However, my dataset size is too big, and I would like to use data.table
to take advantage of fast grouping. My approach to do the same has been the following:
grouping_cols = c("year", "month", "day")
setkeyv(df, grouping_cols)
arrow::write_dataset(df,
output_folder,
format = "parquet",
)
However, now the result is not grouped and a single .parquet file is returned (not fully utilizing the potential of arrow::write_dataset
).
Is there any way to have the same dataset grouped by specified columns but based on data.table
instead of dplyr
groupings?
If you look at the docs the default partitioning
parameter is whatever the dataset
's dplyr::group_vars
are. That concept isn't automatically translated into the data.table analog so you have to supply that parameter if you're not using a dplyr
object as the input.
arrow::write_dataset(df,
output_folder,
partitioning=grouping_cols,
format = "parquet",
)