Search code examples
mysqlquery-optimizationgreatest-n-per-group

Optimizing SQL Query for max value with various conditions from a single MySQL table


I have the following SQL query

SELECT  *
    FROM  `sensor_data` AS `sd1`
    WHERE  (sd1.timestamp BETWEEN '2017-05-13 00:00:00'
                              AND '2017-05-14 00:00:00')
      AND  (`id` = 
            (
                SELECT  `id`
                    FROM  `sensor_data` AS `sd2`
                    WHERE  sd1.mid = sd2.mid
                      AND  sd1.sid = sd2.sid
                    ORDER BY  `value` DESC, `id` DESC
                    LIMIT  1)
           ) 

Background:

I've checked the validity of the query by changing LIMIT 1 to LIMIT 0, and the query works without any problem. However with LIMIT 1 the query doesn't complete, it just states loading until I shutdown and restart.

Breaking the Query down:

I have broken down the query with the date boundary as follows:

SELECT  *
    FROM  `sensor_data` AS `sd1`
    WHERE  (sd1.timestamp BETWEEN '2017-05-13 00:00:00'
                              AND '2017-05-14 00:00:00') 

This takes about 0.24 seconds to return the query with 8200 rows each having 5 columns.

Question:

I suspect the second half of my Query, is not correct or well optimized. The tables are as follows:

Current Table:

+------+-------+-------+-----+-----------------------+
|  id  |  mid  |  sid  |  v  |       timestamp       |
+------+-------+-------+-----+-----------------------+
|  51  |  10   |   1   |  40 |  2015-05-13 11:56:01  |
|  52  |  10   |   2   |  39 |  2015-05-13 11:56:25  |
|  53  |  10   |   2   |  40 |  2015-05-13 11:56:42  |
|  54  |  10   |   2   |  40 |  2015-05-13 11:56:45  |
|  55  |  10   |   2   |  40 |  2015-05-13 11:57:01  |
|  56  |  11   |   1   |  50 |  2015-05-13 11:57:52  |
|  57  |  11   |   2   |  18 |  2015-05-13 11:58:41  |
|  58  |  11   |   2   |  19 |  2015-05-13 11:58:59  |
|  59  |  11   |   3   |  58 |  2015-05-13 11:59:01  |
|  60  |  11   |   3   |  65 |  2015-05-13 11:59:29  |
+------+-------+-------+-----+-----------------------+

Q: How would I get the MAX(v)for each sid for each mid? NB#1: In the example above ROW 53, 54, 55 have all the same value (40), but I would like to retrieve the row with the most recent timestamp, which is ROW 55.

Expected Output:

+------+-------+-------+-----+-----------------------+
|  id  |  mid  |  sid  |  v  |       timestamp       |
+------+-------+-------+-----+-----------------------+
|  51  |  10   |   1   |  40 |  2015-05-13 11:56:01  |
|  55  |  10   |   2   |  40 |  2015-05-13 11:57:01  |
|  56  |  11   |   1   |  50 |  2015-05-13 11:57:52  |
|  58  |  11   |   2   |  19 |  2015-05-13 11:58:59  |
|  60  |  11   |   3   |  65 |  2015-05-13 11:59:29  |
+------+-------+-------+-----+-----------------------+

Structure of the table:

enter image description here

NB#2: Since this table has over 110 million entries, it is critical to have have date boundaries, which limits to ~8000 entries over a 24 hour period.


Solution

  • The query can be written as follows:

    SELECT t1.id, t1.mid, t1.sid, t1.v, t1.ts
    FROM yourtable t1
    INNER JOIN (
            SELECT mid, sid, MAX(v) as v 
                FROM yourtable
            WHERE ts BETWEEN '2015-05-13 00:00:00' AND '2015-05-14 00:00:00'
            GROUP BY mid, sid
        ) t2
        ON  t1.mid = t2.mid
        AND t1.sid = t2.sid
        AND t1.v   = t2.v
    INNER JOIN (
            SELECT mid, sid, v, MAX(ts) as ts
                FROM yourtable
            WHERE ts BETWEEN '2015-05-13 00:00:00' AND '2015-05-14 00:00:00'
            GROUP BY mid, sid, v
        ) t3
        ON  t1.mid = t3.mid
        AND t1.sid = t3.sid
        AND t1.v   = t3.v
        AND t1.ts  = t3.ts; 
    

    Edit and Explanation:

    The first sub-query (first INNER JOIN) fetches MAX(v) per (mid, sid) combination. The second sub-query is to identify MAX(ts) for every (mid, sid, v). At this point, the two queries do not influence each others' results. It is also important to note that ts date range selection is done in the two sub-queries independently such that the final query has fewer rows to examine and no additional WHERE filters to apply.

    Effectively, this translates into getting MAX(v) per (mid, sid) combination initially (first sub-query); and if there is more than one record with the same value MAX(v) for a given (mid, sid) combo, then the excess records get eliminated by the selection of MAX(ts) for every (mid, sid, v) combination obtained by the second sub-query. We then simply associate the output of the two queries by the two INNER JOIN conditions to get to the id of the desired records.

    Demo