Read a partitioned parquet dataset from multiple files with PyArrow and add a partition key based on the filename

I have a bunch of parquet files, each containing a subset of my dataset. Let's say that the files are named data-N.parquet with N being an integer.

I can read them all and subsequently convert to a pandas dataframe:

files = glob.glob("data-**.parquet")
ds = pq.ParquetDataset(
    files,
    metadata_nthreads=64,
).read_table(use_threads=True)
df = ds.to_pandas()

This works just fine. What it would like to have is an additional column in the final data frame, indicating from which file the data is originating.

As far as I understand, the ds data is partitioned, with one partition per file. So it would be a matter of including the partition key in the data frame.

Is this feasible?

Solution

The partition key is, at the moment, included in the dataframe. However, all existing partitioning schemes use directory names for the key. So if your data was /N/data.parquet or /batch=N/data.parquet this will happen (you will need to supply a partitioning object when you read the dataset).

There is no way today (in pyarrow) to get the filename in the returned results.