Search code examples
sqletlquery-performanceimpala

Should I put a row number filter in join condition or in a prior CTE?


I have a subscription table and a payments table that I need to join. I am trying to decide between 2 options and performance is a key consideration.

Which of the two OPTIONS below will perform better?

I am using Impala, and these tables are large (multiple millions of rows) I am needing to only get one row for every id and date grouping (hence the row_number() analytic function).

I have shortened the queries to illustrate my question:

OPTION 1:

WITH cte
   AS (
   SELECT *
      , SUM(amount) OVER (PARTITION BY id, date) 
        AS sameday_total
      , ROW_NUMBER() OVER (PARTITION BY id, date ORDER BY purchase_number DESC)
        AS sameday_rownum
   FROM payments
), 
payment
AS (
    SELECT * 
    FROM cte
    WHERE sameday_rownum = 1
    )
    SELECT s.* 
       , p.sameday_total
    FROM subscription
    INNER JOIN payment ON s.id = p.id

OPTION 2:

WITH payment
   AS (
   SELECT *
          , SUM(payment_amount) OVER (PARTITION BY id, date) 
            AS sameday_total
          , ROW_NUMBER() OVER (PARTITION BY id, date ORDER BY purchase_number DESC)
            AS sameday_rownum
   FROM payments
)
SELECT s.*
       , p.sameday_total
FROM subscription
INNER JOIN payment ON s.id = p.id
                  AND p.sameday_rownum = 1

Solution

  • An "Option 0" also exists. A far more traditional "derived table" which simply does not require use of any CTE.

    SELECT s.*
           , p.sameday_total
    FROM subscription
    INNER JOIN (
               SELECT *
                 , SUM(payment_amount) OVER (PARTITION BY id, date) 
                     AS sameday_total
                 , ROW_NUMBER() OVER (PARTITION BY id, date ORDER BY purchase_number DESC)
                    AS sameday_rownum
               FROM payments
               ) p ON s.id = p.id
                      AND p.sameday_rownum = 1
    

    All options 0,1 and 2 are likely to produce identical or very similar explain plans (although I'm more confident about that statement for SQL Server than Impala).

    Adopting a CTE does - in itself - not make a query more efficient or better performing, so the syntax alteration between option 1 and 2 isn't major. I prefer option 0 myself as I prefer to use CTEs for specific tasks (e.g. recursion).

    What you should do is use explain plans to study what each option produces.