Search code examples
sqlmariadbsubquerysql-execution-planrow-number

How to efficiently get the row indicies of a subquery or joined table using row_number()


I have a table of 'games' where some of those games (but not all) are grouped into 'campaigns' by a campaign ID.

I am trying to write an SQL query which will get a dataset containing various information about the games, but in particular: if a given game is part of a campaign how many games are in that campaign in total (I have this working) and the index of that game within the campaign (e.g. the earliest game in a campaign is index '1', the next is '2' and so on).

I have achieved this, but the execution plan looks terrible and the obvious way to fix that doesn't work, but I get ahead of myself.

Here is the working query, with some extraneous stuff removed:

        g1.`id` AS `game_id`,
        (SELECT
            COUNT(*)
            FROM `games` g3
            WHERE g3.`campaign` = g1.`campaign`
        ) AS `campaign_length`,
        ca2.`ri` AS `campaign_index`,
        ca1.`id` AS `campaign_id`, ca1.`name` AS `campaign_name`
    FROM `games` g1
    LEFT JOIN `campaigns` ca1 ON ca1.`id` = g1.`campaign`
    LEFT JOIN (
        SELECT 
            g4.`id` AS `id`,
            ROW_NUMBER() OVER (
                PARTITION BY g4.`campaign`
                ORDER BY g4.`start` ASC) AS `ri`
            FROM `games` g4
        ) AS ca2 ON ca2.`id` = g1.`id`
    WHERE g1.`end` > CURRENT_TIMESTAMP()
    AND g1.`gamemaster` = 25
    ORDER BY g1.`start` ASC
    ;

The problem with this version is that for table g4 the execution plan lists a full table scan - which is fine at the moment as there's only a few hundred records, but long term will be terrible for performance, especially as this query (or ones very similar to it) will be executed on many different pages of my website. I believe this is happening because the ROW_NUMBER() function needs to number all the rows before the LEFT JOIN's ON statement can filter them down to the ones I actually need.

The obvious solution, which I have tried to no avail, is to add WHERE g4.`campaign` = g1.`campaign` after FROM `games` g4; that way ROW_NUMBER() would only need to number those records that have a chance of being returned in the dataset. However this does not work because g1.`campaign` is not in scope.

I can do WHERE g4.`campaign` IS NOT NULL which at least gets the execution plan down to a Index Conditional instead of a full table scan, but it will still not scale nicely as the number of games in campaigns grows with time.

I know that my "obvious solution" won't work because of the scope problem, but does anyone have a suggestion for how I can achieve what I'm trying to do without a terrible execution plan?


Solution

  • Based on your comments, the campaign_index must be calculated before the WHERE clause is applied. This means that calculation of the campaign_index will always require a full table scan, as the WHERE clause can't reduce the rows being computed over.

    You can, however, use windowed functions rather than a self join and correlated sub-query...

    WITH
      games AS
    (
      SELECT
        *,
        COUNT(*)
          OVER (
            PARTITION BY `campaign`
          )
            AS `campaign_length`,
        ROW_NUMBER()
          OVER (
            PARTITION BY `campaign`
                ORDER BY `start`
          )
            AS `campaign_index`
        FROM
            games
    )
    SELECT
        games.*,
        campaigns.`name`   AS `campaign_name`
    FROM
      games
    LEFT JOIN
      campaigns
        ON campaigns.`id` = games.`campaign`
    WHERE
          games.`end`        > CURRENT_TIMESTAMP()
      AND games.`gamemaster` = 25
    ORDER BY
      games.`start`
    ;