Search code examples
sqldb2subqueryquery-optimization

Filter before Group by or after with CTE


Say I have a query:

SELECT SUM(Data1), SUM(Data2), UserID 
FROM Table1
WHERE Data1 IN (1, 2, 3)
  AND USERID IN (SELECT USERID IN SOME SMALLISH Table of ~10000 Values)
GROUP BY UserID

What is the performance of this query like? Is it better to just wrap it in a CTE and filter the USERID out after the group by? Will it scan each row with the IN? Then it must be faster to first group by right?

edit: Added missing aggregation as mentioned by User @jarlh


Solution

  • I would be inclined to change the query to use EXISTS -- although I'm not sure if this makes a difference in DB2 (some databases have better optimizers than others):

    SELECT SUM(Data1), SUM(Data2), UserID 
    FROM Table1 t1
    WHERE Data1 IN (1, 2, 3) AND
          EXISTS (SELECT 1
                  FROM smallish s
                  WHERE t2.USERID = t1.USERID
                 )
    GROUP BY UserID;
    

    Then, if I understand correctly, your question is whether this query might perform better:

    SELECT SUM(Data1), SUM(Data2), UserID 
    FROM Table1 t1
    WHERE Data1 IN (1, 2, 3)          
    GROUP BY UserID
    HAVING EXISTS (SELECT 1
                  FROM smallish s
                  WHERE t2.USERID = t1.USERID
                 );
    

    First, you can test the two versions on your data and your system. That is always the best way to validate performance questions.

    Second, I would expect the WHERE version to be faster -- and much faster -- because the aggregation has many fewer rows to aggregate.

    One exception is if the database can use an index for the aggregation. I would expect the IN to preclude that possibility. Another exception is if the EXISTS/IN clause is really, really expensive (say no indexes are involved), then running it only once per result row might be faster. That said, this is balanced against the larger amount of data in the aggregation.