Search code examples
mysqlsubqueryquery-optimizationquery-performancegroupwise-maximum

MySQL: Subquery optimization problem where subquery checks over 14000 rows


I need help to optimize the subquery below. In brief, I have the following query where tree table joins branch table on s_id AND max timestamp of branch table based on the subquery condition.

I am happy with the result this query returns. However, this query is very slow. The bottleneck is the dependent subquery(branch2) which checks over 14000 rows. How can I optimize the subquery in order to speed up this query?

SELECT *                
FROM dept.tree tree              
    LEFT JOIN dept.branch branch ON tree.s_id = branch.s_id                    
        AND branch.timestamp =
            (
                SELECT MAX(timestamp)                                   
                FROM dept.branch branch2 
                WHERE branch2.s_id = tree.s_id 
                AND branch2.timestamp <= tree.timestamp
            )                
WHERE tree.timestamp BETWEEN CONVERT_TZ('2020-05-16 00:00:00', 'America/Toronto', 'UTC') 
AND CONVERT_TZ('2020-05-16 23:59:59', 'America/Toronto', 'UTC') 
AND tree.s_id IN ('459','460')                
ORDER BY tree.timestamp ASC;

Table tree:

   id     box_id        timestamp       
373001645   1       2020-05-07 06:00:20 
373001695   1       2020-05-07 06:02:26
373001762   1       2020-05-07 06:05:17
373001794   1       2020-05-07 06:06:38
373001810   2       2020-05-07 06:07:21

Table branch:

   id     box_id        timestamp                  data
373001345   1       2020-05-07 06:00:20     {"R": 0.114, "H": 20.808}
373001395   1       2020-05-07 06:02:26     {"R": 0.12, "H": 15.544}
373001462   1       2020-05-07 06:03:01     {"R": 0.006, "H": 55.469}
373001494   1       2020-05-07 06:04:38     {"R": 0.004, "H": 51.85}
373001496   1       2020-05-07 06:05:18     {"R": 0.02, "H": 5.8965}
373001497   1       2020-05-07 06:06:39     {"R": 0.12, "H": 54.32}
373001510   2       2020-05-07 06:07:09     {"R": 0.34, "H": 1.32}
373001511   2       2020-05-07 06:07:29     {"R": 0.56, "H": 32.7}

branch has s_id and timestamp indexed

I am using 5.7.25-google-log version

EXPLAIN gives the following:

    id    select_type        table     partitions    type    possible_keys    key    key_len    ref    rows    filtered     Extra
    1   PRIMARY tree        range   unique_timestamp_s_id,idx_s_id_timestamp,idx_timestamp  idx_s_id_timestamp  10      2629    100.00  Using index condition; Using filesort
    1   PRIMARY branch      ref unique_timestamp_s_id,idx_timestamp unique_timestamp_s_id   5   func    1   100.00  Using where
    2   DEPENDENT SUBQUERY  branch2     ref unique_timestamp_s_id,idx_s_id_timestamp,idx_timestamp  idx_s_id_timestamp  5   tree.s_id   14122   33.33   Using where; Using index

Solution

  • This shall be faster:

    select
        tree.s_id, tree.timestamp, branch.data
    from
        (
            SELECT
                tree.s_id, tree.timestamp, max(branch.timestamp) as max_branch_timestamp
            FROM
                dept.tree tree
                    LEFT JOIN dept.branch branch
                    ON(
                            branch.s_id      =  tree.s_id
                        and branch.timestamp <= tree.timestamp
                    )
            WHERE
                tree.timestamp BETWEEN
                    CONVERT_TZ('2020-05-16 00:00:00', 'America/Toronto', 'UTC') AND
                    CONVERT_TZ('2020-05-16 23:59:59', 'America/Toronto', 'UTC')
            AND tree.s_id IN ('459','460')                
            group by tree.s_id, tree.timestamp
        ) tree
            left outer join branch
            on(
                    branch.s_id      = tree.s_id
                and branch.timestamp = tree.max_branch_timestamp
            )