I save a pandas dataframe as follows:
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.Table.from_pandas(my_df)
pq.write_to_dataset(table, root_path="data/bfl", partition_cols=['pnr_group'])
I can find it stored in a partitioned directory structure like this:
data/bfl/pnr_group=0/319a1fb5557a342c1b55356ce5123123-0.parquet
When I read an individual parquet file directly using pq.read_table(), I can see the data. However, when trying to read it using PyArrow's Dataset API with filtering, I get empty results:
import pyarrow.dataset as ds
import pyarrow as pa
# This works - has data
import pyarrow.parquet as pq
file_path = 'data/bfl/pnr_group=0/319a1fb5557a342c1b55356ce5123123-0.parquet'
table = pq.read_table(file_path)
print(len(table)) # Shows rows
# This finds the correct files but returns empty data
dataset = ds.dataset(
'data/bfl',
format='parquet',
partitioning=ds.DirectoryPartitioning.discover(['pnr_group'])
)
filter_expr = ds.field('pnr_group') == '0'
filtered_dataset = dataset.filter(filter_expr)
df = filtered_dataset.to_table().to_pandas() # Returns empty dataframe
The dataset schema shows 'pnr_group' as a string type, and dataset.files correctly lists all the parquet files. However, after filtering and converting to pandas, the resulting dataframe is empty. How can I correctly read and filter partitioned parquet files using PyArrow's Dataset API?
DirectoryPartitioning
assumes the structure to be data/bfl/0/xxx.parquet
.
I think you want to use HivePartitioning
.
dataset = ds.dataset(
'data/bfl',
format='parquet',
partitioning=ds.HivePartitioning.discover(['pnr_group'])
)
dataset.filter(ds.field('pnr_group') == 0).to_table()
PS: You may have to specify explicitly that pnr_group is a string (discover will assume it's an int by default).