Search code examples
sqlamazon-web-servicesamazon-redshiftquery-performance

Performance of JOIN then UNION vs. UNION then JOIN


I have a SQL query along the following lines:

WITH a AS (
    SELECT *
    FROM table1
        INNER JOIN table3 ON table1.id = table3.id
    WHERE table1.condition = 'something'
),

b AS (
    SELECT *
    FROM table2
        INNER JOIN table3 ON table2.id = table3.id
    WHERE table2.condition = 'something else'
),

combined AS (
    SELECT *
    FROM a
    UNION
    SELECT *
    FROM b
)

SELECT *
FROM combined

I rewrote this as:

WITH a AS (
    SELECT *
    FROM table1
    WHERE table1.condition = 'something'
),

b AS (
    SELECT *
    FROM table2
    WHERE table2.condition = 'something else'
),

combined AS (
    SELECT *
    FROM (
        SELECT *
        FROM a
        UNION
        SELECT *
        FROM b
    ) union
    INNER JOIN table3 ON union.id = table3.id
)

SELECT *
FROM combined

I expected that this might be more performant, since it's only doing the JOIN once, or at the very least that it would have no effect on execution time. I was surprised to find that the query now takes almost twice as long to run.

This is no problem since it worked perfectly well before, I only really rewrote it out of my own personal style preference anyway so I'll stick with the original. But I'm no expert when it comes to databases/SQL, so I was interested to know if anyone can share any insights as to why this second approach is so much less performant?

If it makes a difference, it's a Redshift database, table1 and table2 are both around ~250 million rows, table3 is ~1 million rows, and combined has less than 1000 rows.


Solution

  • The SQL optimizer has more information on "bare" tables than on "computed" tables. So, it is easier to optimize the two CTEs.

    In a database that uses indexes, this might affect index usage. In Redshift, this might incur additional data movement.

    In this particular case, though, I suspect the issue might have to do with filtering via the JOIN operation. The UNION is incurring overhead to remove duplicates. By filtering before the UNION, duplicate removal is faster than filtering afterwards.

    In addition, the UNION may affect where the data is located, so the second version might require additional data movement.