Search code examples
rdplyrdata.tableapache-arrow

How to write an arrow dataset based on a data.table grouping?


I have a dataset called df where I have year, month and day variables. I would like to use the write_dataset function to output a folder with the standard arrow dataset syntax as in the following image:

enter image description here

Within each folder there will be month=1, month=2, and so on.

Now, in order to create this I have used the following code:

df <- df %>% group_by(year, month, day)
output_folder = "my/path"
arrow::write_dataset(df, 
                     output_folder, 
                     format = "parquet", 
                     )

However, my dataset size is too big, and I would like to use data.table to take advantage of fast grouping. My approach to do the same has been the following:

grouping_cols = c("year", "month", "day")
setkeyv(df, grouping_cols)

arrow::write_dataset(df, 
                     output_folder, 
                     format = "parquet", 
                     )

However, now the result is not grouped and a single .parquet file is returned (not fully utilizing the potential of arrow::write_dataset).

enter image description here

Is there any way to have the same dataset grouped by specified columns but based on data.table instead of dplyr groupings?


Solution

  • If you look at the docs the default partitioning parameter is whatever the dataset's dplyr::group_vars are. That concept isn't automatically translated into the data.table analog so you have to supply that parameter if you're not using a dplyr object as the input.

    arrow::write_dataset(df, 
                        output_folder,
                        partitioning=grouping_cols,
                        format = "parquet", 
                        )