I have a large Parquet file with a number of small row groups. I'd like to produce a new Parquet file with a single (bigger) row group, and I'm operating in Python. I could do something like:
import pyarrow.parquet as pq
table = pq.read_table('many_tiny_row_groups.parquet')
pq.write_table(table, 'one_big_row_group.parquet')
# Lots of row groups...
print (pq.ParquetFile('many_tiny_row_groups.parquet').num_row_groups)
# Now, only 1 row group...
print (pq.ParquetFile('one_big_row_group.parquet').num_row_groups)
However, this requires that I read the entire Parquet file into memory at once. I would like to avoid doing that. Is there some sort of "streaming" approach in which the memory footprint can stay small?
pyarrow's write_dataset
accepts an iterable of record batches, so you can avoid loading everything into a single in-memory table at once.
import pyarrow.dataset as ds
input_dataset = ds.dataset("folder_of_tiny_rowgroup_parquets")
scanner = input_dataset.scanner()
ds.write_dataset(scanner, "folder_of_large_rowgroup_parquets", format="parquet", min_rows_per_group=20000)
This is not quite what you asked for because instead of rewriting file-by-file, the output will likely have fewer files than your input. That is, it may also compact across files. It also limits the control you have of the name given to individual parquet file(s) that get written, although see the basename_template
argument for some control there.
From: https://arrow.apache.org/docs/python/dataset.html#writing-large-amounts-of-data