I have a PyArrow Parquet file that is too large to process in memory. Because data can be easily partitioned into different shards, I'd like to manually partition this and create a PyArrow dataset out of the file. While I am partitioning, the rows within the partition themselves need to be re-sorted, so that iterating the data can be done in a natural order.
Here is a schema snippet if it is relevant. The partitioning key is chain_id
:
pa.schema([
("chain_id", pa.uint32()),
("pair_id", pa.uint64()),
("block_number", pa.uint32()),
("timestamp", pa.timestamp("s")),
("tx_hash", pa.binary(32)),
("log_index", pa.uint32()),
])
The process I was planning to use it
chain_id
in the above schema)However the documentation on FileSystemDataset on PyArrow is skinny. My question along the lines are
FilesystemDataset
, assuming the table is the whole content of a partition itself?pyarrow.dataset.FileSystemDataset
gives assuming I want to always read the data in the insertd presorted order with to_batches
?I recommend using the dataset api. It can filter on chain_id
without loading, and write the partitions. Dataset reading does maintain order.
My own partitioning utility is similar to what you need. It provides sorting, and could be further adapted or provide ideas.