Search code examples
pythonparquetpartitioningpyarrow

Extract partition list from pyarrow.dataset.ParquetFileFragment object


I have a pyarrow.dataset.ParquetFileFragment object like this:

<pyarrow.dataset.ParquetFileFragment path=pq-test/Location=US-California/Industry=HT-SoftWare/dce9900c46f94ec3a8dca094cf62bd34-0.parquet partition=[Industry=HT-SoftWare, Location=US-California]>

I could get the path using .path but .partition method does not give the partition list. Is there anyway to grab it?


Solution

  • There is an PR open that would expose ds.get_partition_keys publicly: https://github.com/apache/arrow/pull/33862/files and that would help you get a nice dict from partition_expression attribute of a ds.ParquetFileFragment.

    Note that you have to add partitioning parameter when you read the dataset, to get a valid expression:

    >>> import pyarrow as pa
    >>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
    ...                   'n_legs': [2, 2, 4, 4, 5, 100],
    ...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
    ...                              "Brittle stars", "Centipede"]})
    >>> import pyarrow.dataset as ds
    >>> ds.write_dataset(table, "dataset_name_fragments", format="parquet",
    ...                  partitioning=["year"], partitioning_flavor="hive")
    >>> dataset = ds.dataset('dataset_name_fragments/', format="parquet", partitioning="hive")
    >>> fragments = dataset.get_fragments()
    >>> fragment = next(fragments)
    >>> fragment.partition_expression
    <pyarrow.compute.Expression (year == 2019)>
    

    It would be also great to have an attribute that would get you the partition list also and will be added to the mentioned PR.