Could anyone please explain with examples how exactly predicate pushdown works?
Say you want to execute a query
SELECT
SUM(price)
FROM sales
WHERE
purchase_date BETWEEN '2018-01-01' and '2018-01-31';
A very trivial implementation of a query engine is to iterate over all parquet/orc files, deserialize the price
and purchase_date
columns, apply the predicate on the purchase_date
and sum the filtered rows.
Parquet (not sure about orc) maintains statistics on the columns in each file, so if the execution engine is smart enough, it can look at the min/max of the purchase_date
within the statistics and determine if any rows is going to match. For example, if purchase_date.min=2014-05-05
and purchase_date.max=2014-05-06
, it can deduce that the predicate will always evaluate to false.
In other words, it can skip parquet files by combining statistics and the filter predicate . This can lead to massive gain of performance because IO (file or memory) is usually the bottleneck. The gain is inversely proportional to the selectivity (the percentage of matching rows).
The term predicate push-down comes from the fact that you're "hinting" the scan operator with the predicate that is then going to be used to filter the rows of interest. Or, pushing the predicate to the scan.