Search code examples
amazon-s3hiveparquetduckdb

Not seeing file-level pushdown predicate filtering querying hive-partitioned table in S3


I am using DuckDB in DuckDB-WASM. I am creating a view on top of a hive-partitioned table in S3 with SQL like:

create or replace view my_view as
select
    Part1 as part_1
  , Part2 as part_2
  , Column1 as column_1
  , Column2 as column_2
  from read_parquet(
    [
      's3://my-bucket/path/to/part1=abc/part2=123/000.parquet',
      's3://my-bucket/path/to/part1=def/part2=456/000.parquet',
      's3://my-bucket/path/to/part1=ghi/part2=789/000.parquet'
    ],
    hive_partitioning=1)

I then execute a query like:

select count(*) from my_view where part1 = 'abc' and part2 = '123'

I'm expecting DuckDB to use a pushdown predicate to read only the s3://my-bucket/path/to/part1=abc/part2=123/000.parquet file. Instead, I see that the Chrome debug tools network tab reads all three files.


Solution

  • I believe I figured it out. S3 prefixes are case-sensitive. Changing the view to this seems to have fixed it for me.

    create or replace view my_view as
    select
        part1 as part_1
      , part2 as part_2
      , Column1 as column_1
      , Column2 as column_2
      from read_parquet(
        [
          's3://my-bucket/path/to/part1=abc/part2=123/000.parquet',
          's3://my-bucket/path/to/part1=def/part2=456/000.parquet',
          's3://my-bucket/path/to/part1=ghi/part2=789/000.parquet'
        ],
        hive_partitioning=1)