Search code examples
mysqlquery-optimizationgreatest-n-per-groupmysql-8.0union-all

CTE built from UNION running very slow


I've put together a simplified version of a query I'm struggling with:

WITH cte AS (
    SELECT 
        num_serves_p1 AS num_serves, player_id_p1 AS player_id, location, date_
    FROM
        table1 
    UNION ALL SELECT 
        num_serves_p2 AS num_serves, player_id_p2 AS player_id, location, date_
    FROM
        table1
)
SELECT 
    id_, SUM(cte.num_serves)
FROM
    table1 AS t
        JOIN
    cte ON cte.player_id = t.player_id_p1
        AND cte.location = t.location
        AND cte.date_ < t.date_
WHERE 
    t.id_ = 12345

This currently takes over a minute to run for about a million rows.

If I chop out the second part of the UNION_ALL to give me:

WITH cte AS (
    SELECT 
        num_serves_p1 AS num_serves, player_id_p1 AS player_id, location, date_
    FROM
        table1 
)
SELECT 
    id_, SUM(cte.num_serves)
FROM
    table1 AS t
        JOIN
    cte ON cte.player_id = t.player_id_p1
        AND cte.location = t.location
        AND cte.date_ < t.date_
WHERE 
    t.id_ = 12345

Or the first part of the UNION_ALL to give me:

WITH cte AS (
    SELECT 
        num_serves_p2 AS num_serves, player_id_p2 AS player_id, location, date_
    FROM
        table1 
)
SELECT 
    id_, SUM(cte.num_serves)
FROM
    table1 AS t
        JOIN
    cte ON cte.player_id = t.player_id_p1
        AND cte.location = t.location
        AND cte.date_ < t.date_
WHERE 
    t.id_ = 12345

Then either of these queries run in under 0.1 seconds.

So I figure all my indexes are good but the issue lies in using UNION_ALL. Given the drastic difference in performance then my hunch is that in the first query the CTE loads all the rows in table1 (x2) whereas in the second and third queries the CTE only loads the rows required by the joins?

I realise that I could create two CTEs and simply add the results together however at some point I'm going to calculate fields that rely on a date ordered UNION_ALL CTE.

Any ideas on why I'm seeing the poor performance with the first query and what I might be able to do about it?

Quick final note: I haven't included all the usual table descriptions etc because this is really a pseudo query. I figured there'd be enough to go on as I've isolated the usage of UNION_ALL as being the likely issue. I'm also not sure how to generate sufficient data to illustrate the issue. Fingers crossed the above is enough.


Solution

  • The Optimizer is not smart enough with unions, joins, CTEs, to to propagate the "id=" all the way through. Instead, it is evaluating each part of the UNION, creating a temp table with all those rows, and possibly building an index to help with the next step.

    It may run even faster if you do the SUM for each table copy and then add them together. (I can't even tell if that is a valid transformation, and the Optimizer won't even try it.)

    More

    AND cte.date_ < t.date_
    WHERE 
        t.id_ = 12345
    

    -->

    AND cte.date < ( SELECT date_ FROM t WHERE id_ = 12345 )
    

    Or, this might be better:

    SELECT ...
        FROM ( SELECT date_ FROM t WHERE id_ = 12345 ) AS d
        JOIN ...  (( the other stuff ))
        WHERE cte.date_ < d.date_
    

    and more

    SELECT  
        (
            SELECT  SUM(num_serves)
                FROM  table1
                WHERE  player_id = this.p1_id 
        ) AS p1_prev_serves, 
        (
            SELECT  SUM(num_serves)
                FROM  table1
                WHERE  player_id = this.p2_id 
        ) AS p2_prev_serves
        FROM  table1 AS this
        WHERE  player_id IN (this.p1_id, this.p2_id)
          AND  this.game_id = 12345;
    SELECT  SUM(num_serves) AS prev_serves
        FROM  table1 AS all