Search code examples
pythonpandasparquetpyarrow

PyArrow Dataset filtering not working with partitioned parquet files


I save a pandas dataframe as follows:

import pyarrow as pa
import pyarrow.parquet as pq
table = pa.Table.from_pandas(my_df)
pq.write_to_dataset(table, root_path="data/bfl", partition_cols=['pnr_group'])

I can find it stored in a partitioned directory structure like this:

data/bfl/pnr_group=0/319a1fb5557a342c1b55356ce5123123-0.parquet

When I read an individual parquet file directly using pq.read_table(), I can see the data. However, when trying to read it using PyArrow's Dataset API with filtering, I get empty results:

import pyarrow.dataset as ds
import pyarrow as pa

# This works - has data
import pyarrow.parquet as pq
file_path = 'data/bfl/pnr_group=0/319a1fb5557a342c1b55356ce5123123-0.parquet'
table = pq.read_table(file_path)
print(len(table))  # Shows rows

# This finds the correct files but returns empty data
dataset = ds.dataset(
    'data/bfl',
    format='parquet',
    partitioning=ds.DirectoryPartitioning.discover(['pnr_group'])
)

filter_expr = ds.field('pnr_group') == '0'
filtered_dataset = dataset.filter(filter_expr)
df = filtered_dataset.to_table().to_pandas()  # Returns empty dataframe

The dataset schema shows 'pnr_group' as a string type, and dataset.files correctly lists all the parquet files. However, after filtering and converting to pandas, the resulting dataframe is empty. How can I correctly read and filter partitioned parquet files using PyArrow's Dataset API?


Solution

  • DirectoryPartitioning assumes the structure to be data/bfl/0/xxx.parquet.

    I think you want to use HivePartitioning.

    dataset = ds.dataset(
        'data/bfl',
        format='parquet',
        partitioning=ds.HivePartitioning.discover(['pnr_group'])
    )
    dataset.filter(ds.field('pnr_group') == 0).to_table()
    

    PS: You may have to specify explicitly that pnr_group is a string (discover will assume it's an int by default).