Search code examples
hadoopparquetorc

How does predicate pushdown work exactly?


Could anyone please explain with examples how exactly predicate pushdown works?


Solution

  • Say you want to execute a query

    SELECT 
      SUM(price) 
    FROM sales 
    WHERE 
      purchase_date BETWEEN '2018-01-01' and '2018-01-31';
    

    A very trivial implementation of a query engine is to iterate over all parquet/orc files, deserialize the price and purchase_date columns, apply the predicate on the purchase_date and sum the filtered rows.

    Parquet (not sure about orc) maintains statistics on the columns in each file, so if the execution engine is smart enough, it can look at the min/max of the purchase_date within the statistics and determine if any rows is going to match. For example, if purchase_date.min=2014-05-05 and purchase_date.max=2014-05-06, it can deduce that the predicate will always evaluate to false.

    In other words, it can skip parquet files by combining statistics and the filter predicate . This can lead to massive gain of performance because IO (file or memory) is usually the bottleneck. The gain is inversely proportional to the selectivity (the percentage of matching rows).

    The term predicate push-down comes from the fact that you're "hinting" the scan operator with the predicate that is then going to be used to filter the rows of interest. Or, pushing the predicate to the scan.