I have a SQL query along the following lines:
WITH a AS (
SELECT *
FROM table1
INNER JOIN table3 ON table1.id = table3.id
WHERE table1.condition = 'something'
),
b AS (
SELECT *
FROM table2
INNER JOIN table3 ON table2.id = table3.id
WHERE table2.condition = 'something else'
),
combined AS (
SELECT *
FROM a
UNION
SELECT *
FROM b
)
SELECT *
FROM combined
I rewrote this as:
WITH a AS (
SELECT *
FROM table1
WHERE table1.condition = 'something'
),
b AS (
SELECT *
FROM table2
WHERE table2.condition = 'something else'
),
combined AS (
SELECT *
FROM (
SELECT *
FROM a
UNION
SELECT *
FROM b
) union
INNER JOIN table3 ON union.id = table3.id
)
SELECT *
FROM combined
I expected that this might be more performant, since it's only doing the JOIN
once, or at the very least that it would have no effect on execution time. I was surprised to find that the query now takes almost twice as long to run.
This is no problem since it worked perfectly well before, I only really rewrote it out of my own personal style preference anyway so I'll stick with the original. But I'm no expert when it comes to databases/SQL, so I was interested to know if anyone can share any insights as to why this second approach is so much less performant?
If it makes a difference, it's a Redshift database, table1
and table2
are both around ~250 million rows, table3
is ~1 million rows, and combined
has less than 1000 rows.
The SQL optimizer has more information on "bare" tables than on "computed" tables. So, it is easier to optimize the two CTEs.
In a database that uses indexes, this might affect index usage. In Redshift, this might incur additional data movement.
In this particular case, though, I suspect the issue might have to do with filtering via the JOIN
operation. The UNION
is incurring overhead to remove duplicates. By filtering before the UNION
, duplicate removal is faster than filtering afterwards.
In addition, the UNION
may affect where the data is located, so the second version might require additional data movement.