Search code examples
pythonparquetpyarrowapache-arrow

Read a partitioned parquet dataset from multiple files with PyArrow and add a partition key based on the filename


I have a bunch of parquet files, each containing a subset of my dataset. Let's say that the files are named data-N.parquet with N being an integer.

I can read them all and subsequently convert to a pandas dataframe:

files = glob.glob("data-**.parquet")
ds = pq.ParquetDataset(
    files,
    metadata_nthreads=64,
).read_table(use_threads=True)
df = ds.to_pandas()

This works just fine. What it would like to have is an additional column in the final data frame, indicating from which file the data is originating.

As far as I understand, the ds data is partitioned, with one partition per file. So it would be a matter of including the partition key in the data frame.

Is this feasible?


Solution

  • The partition key is, at the moment, included in the dataframe. However, all existing partitioning schemes use directory names for the key. So if your data was /N/data.parquet or /batch=N/data.parquet this will happen (you will need to supply a partitioning object when you read the dataset).

    There is no way today (in pyarrow) to get the filename in the returned results.