Search code examples
pythonmemorycompressionparquet

Collapsing row-groups in Parquet efficiently


I have a large Parquet file with a number of small row groups. I'd like to produce a new Parquet file with a single (bigger) row group, and I'm operating in Python. I could do something like:

import pyarrow.parquet as pq
table = pq.read_table('many_tiny_row_groups.parquet')
pq.write_table(table, 'one_big_row_group.parquet')

# Lots of row groups...
print (pq.ParquetFile('many_tiny_row_groups.parquet').num_row_groups)
# Now, only 1 row group...
print (pq.ParquetFile('one_big_row_group.parquet').num_row_groups)

However, this requires that I read the entire Parquet file into memory at once. I would like to avoid doing that. Is there some sort of "streaming" approach in which the memory footprint can stay small?


Solution

  • pyarrow's write_dataset accepts an iterable of record batches, so you can avoid loading everything into a single in-memory table at once.

    import pyarrow.dataset as ds
    
    input_dataset = ds.dataset("folder_of_tiny_rowgroup_parquets")
    scanner = input_dataset.scanner()
    ds.write_dataset(scanner, "folder_of_large_rowgroup_parquets", format="parquet", min_rows_per_group=20000)
    

    This is not quite what you asked for because instead of rewriting file-by-file, the output will likely have fewer files than your input. That is, it may also compact across files. It also limits the control you have of the name given to individual parquet file(s) that get written, although see the basename_template argument for some control there.

    From: https://arrow.apache.org/docs/python/dataset.html#writing-large-amounts-of-data