Extract partition list from pyarrow.dataset.ParquetFileFragment object

I have a pyarrow.dataset.ParquetFileFragment object like this:

<pyarrow.dataset.ParquetFileFragment path=pq-test/Location=US-California/Industry=HT-SoftWare/dce9900c46f94ec3a8dca094cf62bd34-0.parquet partition=[Industry=HT-SoftWare, Location=US-California]>

I could get the path using .path but .partition method does not give the partition list. Is there anyway to grab it?

Solution

There is an PR open that would expose ds.get_partition_keys publicly: https://github.com/apache/arrow/pull/33862/files and that would help you get a nice dict from partition_expression attribute of a ds.ParquetFileFragment.

Note that you have to add partitioning parameter when you read the dataset, to get a valid expression:

>>> import pyarrow as pa
>>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
...                   'n_legs': [2, 2, 4, 4, 5, 100],
...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
...                              "Brittle stars", "Centipede"]})
>>> import pyarrow.dataset as ds
>>> ds.write_dataset(table, "dataset_name_fragments", format="parquet",
...                  partitioning=["year"], partitioning_flavor="hive")
>>> dataset = ds.dataset('dataset_name_fragments/', format="parquet", partitioning="hive")
>>> fragments = dataset.get_fragments()
>>> fragment = next(fragments)
>>> fragment.partition_expression
<pyarrow.compute.Expression (year == 2019)>

It would be also great to have an attribute that would get you the partition list also and will be added to the mentioned PR.