I have a pyarrow.dataset.ParquetFileFragment object like this:
<pyarrow.dataset.ParquetFileFragment path=pq-test/Location=US-California/Industry=HT-SoftWare/dce9900c46f94ec3a8dca094cf62bd34-0.parquet partition=[Industry=HT-SoftWare, Location=US-California]>
I could get the path using .path
but .partition
method does not give the partition list. Is there anyway to grab it?
There is an PR open that would expose ds.get_partition_keys
publicly: https://github.com/apache/arrow/pull/33862/files and that would help you get a nice dict from partition_expression
attribute of a ds.ParquetFileFragment
.
Note that you have to add partitioning
parameter when you read the dataset, to get a valid expression:
>>> import pyarrow as pa
>>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
... 'n_legs': [2, 2, 4, 4, 5, 100],
... 'animal': ["Flamingo", "Parrot", "Dog", "Horse",
... "Brittle stars", "Centipede"]})
>>> import pyarrow.dataset as ds
>>> ds.write_dataset(table, "dataset_name_fragments", format="parquet",
... partitioning=["year"], partitioning_flavor="hive")
>>> dataset = ds.dataset('dataset_name_fragments/', format="parquet", partitioning="hive")
>>> fragments = dataset.get_fragments()
>>> fragment = next(fragments)
>>> fragment.partition_expression
<pyarrow.compute.Expression (year == 2019)>
It would be also great to have an attribute that would get you the partition list also and will be added to the mentioned PR.