Search code examples
amazon-s3parquetduckdb

Predicate Pushdown in DuckDB for a Parquet file in S3


I have a single parquet file in S3 (not a partitioned one).

I need to query like this - "select * from read_parquet('s3://....') where colA=1 and colB=2" from a ECS container (where I've installed DuckDB)

  1. Will DuckDB read the entire parquet file into memory and then apply the filters on colA and colB (or) Will DuckDB be able to selectively read records from the parquet file by leveraging the parquet metadata ?

I know that predicate pushdown happens for parquet files available in a local filesystem but unsure about S3.

To do something like this, it has to figure out the scanning range for the metadata section in parquet file and then selectively read row groups as well. So, I'm not sure if it works this way

  1. If I have a partitioned parquet folder, will duckdb be able to automatically select the data from the right partition based on the predicates in the query

I've seen a few answers where I understand that it is possible to query partitioned parquet using DuckDB. But again I'm not sure if that is applicable when querying from S3

Any documentation or pointers to the code would be really useful !

Related :

Thanks


Solution

  • DuckDB will be able to perform predicate pushdown on all filesystems that can do range reads. On S3 (and also regular http(s)) the HTTP range header is used to first read the meta data then only download the parts of the parquet file that are required for the query. Note that DuckDB contains a prefetching mechanism the ensure the total amount of requests done is kept within a reasonable amount.

    A few pointers to some relevant code: