Search code examples
parquetpyarrow

Partitioning PyArrow Parquet file and writing it out sorted to a dataset


I have a PyArrow Parquet file that is too large to process in memory. Because data can be easily partitioned into different shards, I'd like to manually partition this and create a PyArrow dataset out of the file. While I am partitioning, the rows within the partition themselves need to be re-sorted, so that iterating the data can be done in a natural order.

Here is a schema snippet if it is relevant. The partitioning key is chain_id:

pa.schema([
    ("chain_id", pa.uint32()),
    ("pair_id", pa.uint64()),
    ("block_number", pa.uint32()),
    ("timestamp", pa.timestamp("s")),
    ("tx_hash", pa.binary(32)),
    ("log_index", pa.uint32()),
])

The process I was planning to use it

  • Determine partition ids beforehand (chain_id in the above schema)
  • Open a new dataset for writing
  • For each partition id
    • Create an in-memory temporary PyArrow table
    • Read (iterate batches) the source Parquet file
      • Identify rows belonging to this partition and add them to the in-memory table
    • Sort the rows in memory
    • Append table to the dataset

However the documentation on FileSystemDataset on PyArrow is skinny. My question along the lines are

  • How do I add full tables to a FilesystemDataset, assuming the table is the whole content of a partition itself?
  • Is there any existing tools to partition (PyArrow) Parquet files to datasets, without needing a write a manual script?
  • What kind of ordering guarantees pyarrow.dataset.FileSystemDataset gives assuming I want to always read the data in the insertd presorted order with to_batches?
  • Any other PyArrwo tips dealing with data and datasets that does not fit into RAM?

Solution

  • I recommend using the dataset api. It can filter on chain_id without loading, and write the partitions. Dataset reading does maintain order.

    My own partitioning utility is similar to what you need. It provides sorting, and could be further adapted or provide ideas.