Say I have a query:
SELECT SUM(Data1), SUM(Data2), UserID
FROM Table1
WHERE Data1 IN (1, 2, 3)
AND USERID IN (SELECT USERID IN SOME SMALLISH Table of ~10000 Values)
GROUP BY UserID
What is the performance of this query like? Is it better to just wrap it in a CTE and filter the USERID out after the group by? Will it scan each row with the IN? Then it must be faster to first group by right?
edit: Added missing aggregation as mentioned by User @jarlh
I would be inclined to change the query to use EXISTS
-- although I'm not sure if this makes a difference in DB2 (some databases have better optimizers than others):
SELECT SUM(Data1), SUM(Data2), UserID
FROM Table1 t1
WHERE Data1 IN (1, 2, 3) AND
EXISTS (SELECT 1
FROM smallish s
WHERE t2.USERID = t1.USERID
)
GROUP BY UserID;
Then, if I understand correctly, your question is whether this query might perform better:
SELECT SUM(Data1), SUM(Data2), UserID
FROM Table1 t1
WHERE Data1 IN (1, 2, 3)
GROUP BY UserID
HAVING EXISTS (SELECT 1
FROM smallish s
WHERE t2.USERID = t1.USERID
);
First, you can test the two versions on your data and your system. That is always the best way to validate performance questions.
Second, I would expect the WHERE
version to be faster -- and much faster -- because the aggregation has many fewer rows to aggregate.
One exception is if the database can use an index for the aggregation. I would expect the IN
to preclude that possibility. Another exception is if the EXISTS
/IN
clause is really, really expensive (say no indexes are involved), then running it only once per result row might be faster. That said, this is balanced against the larger amount of data in the aggregation.