I have a single parquet file in S3 (not a partitioned one).
I need to query like this - "select * from read_parquet('s3://....') where colA=1 and colB=2
" from a ECS container (where I've installed DuckDB)
I know that predicate pushdown happens for parquet files available in a local filesystem but unsure about S3.
To do something like this, it has to figure out the scanning range for the metadata section in parquet file and then selectively read row groups as well. So, I'm not sure if it works this way
I've seen a few answers where I understand that it is possible to query partitioned parquet using DuckDB. But again I'm not sure if that is applicable when querying from S3
Any documentation or pointers to the code would be really useful !
Related :
Thanks
DuckDB will be able to perform predicate pushdown on all filesystems that can do range reads. On S3 (and also regular http(s)) the HTTP range header is used to first read the meta data then only download the parts of the parquet file that are required for the query. Note that DuckDB contains a prefetching mechanism the ensure the total amount of requests done is kept within a reasonable amount.
A few pointers to some relevant code: