I would like to merge small parquet files into 1 or 2 bigger files. Is it possible to set a max file size? My goal is to get files between 200MB-1GB to optimize Athena request. Is it possible to do it with Pyarrow?
Currently (with version 2) it is not possible to set a maximum file size. One thing you could do is to write a write file using the pyarrow.parquet.ParquetWriter
class in chunks. Once you have written a chunk, you could check the size of the currently written content. There will be more data added to the file once you close the writer but that is typically less than 64kb. Be sure though to not pick too small row groups as that would defeat Parquet's compression & encoding performance. I would suggest to pick a chunk size (number of rows) that typically leads to 50MiB of data in your case.