python amazon-web-services amazon-s3 aws-data-wrangler

awswrangler.s3.read_parquet ignores partition_filter argument

The partition_filter argument in wr.s3.read_parquet() is failing to filter a partitioned parquet dataset on S3. Here's a reproducible example (might require a correctly configured boto3_session argument):

Dataset setup:

import pandas as pd
import awswrangler as wr
import boto3

s3_path = "s3://bucket-name/folder"

df = pd.DataFrame({"val": [1,3,2,5], "date": ['2021-04-01','2021-04-01','2021-04-02','2021-04-03']})

wr.s3.to_parquet(
    df = df,
    path = s3_path,
    dataset = True,
    partition_cols = ['date']
)
#> {'paths': ['s3://bucket-name/folder/date=2021-04-01/38399541e6fe4fa7866181479dd28e8e.snappy.parquet',
#>   's3://bucket-name/folder/date=2021-04-02/0a556212b5f941c7aa3c3775d2387419.snappy.parquet',
#>   's3://bucket-name/folder/date=2021-04-03/cb71397bea104787a50a90b078d564bd.snappy.parquet'],
#>  'partitions_values': {'s3://aardvark-gdelt/headlines/date=2021-04-01/': ['2021-04-01'],
#>   's3://bucket-name/folder/date=2021-04-02/': ['2021-04-02'],
#>   's3://bucket-name/folder/date=2021-04-03/': ['2021-04-03']}}

S3 data is then viewable in console:

But reading back in with date filter returns 4 records:

wr.s3.read_parquet(path = s3_path,
                   partition_filter = lambda x: x["date"] >= "2021-04-02"
)
#>      val
#> 0    1
#> 1    3
#> 2    2
#> 3    5

In fact sub'ing lambda x: False still returns 4 rows. What am I missing? This is from the guidance:

partition_filter (Optional[Callable[[Dict[str, str]], bool]]) – Callback Function filters to apply on PARTITION columns (PUSH-DOWN filter). This function MUST receive a single argument (Dict[str, str]) where keys are partitions names and values are partitions values. Partitions values will be always strings extracted from S3. This function MUST return a bool, True to read the partition or False to ignore it. Ignored if dataset=False. E.g lambda x: True if x["year"] == "2020" and x["month"] == "1" else False

I note the dataframes coming back do not include the partition 'date' column that was in the uploaded data - can see no references to this removal in the docs, and it's unclear if relevant.

Solution

From the documentation, Ignored if dataset=False.. Adding dataset=True as an argument to your read_parquet call will do the trick