Search code examples
mysqlindexingtime-seriesquery-optimization

Improving performance of SQL query for time series data


I'm writing a query in MySQL for a Grafana dashboard, but the performance of only this particular query is very poor. This is the query:

SELECT 
        br.reading
        ,CONCAT(ino.project,"_",SUBSTRING_INDEX(br.sensor,"_",-1)) AS new_metric
        ,ino.vessel
        ,ADDTIME(DATE('2023-01-01'),-TIMEDIFF(ino.date,br.datetime)) AS norm_date        
        ,br.sensor

FROM db.milestones br
 
INNER JOIN db.projectdb ino
        ON br.datetime BETWEEN ino.date AND ino.enddate

WHERE project IN ( 'project1','project2','project3','project4' ) 
AND br.sensor LIKE CONCAT(ino.vessel , '%')

EXPLAIN output:

| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows     | filtered | Extra                                          |
| -- | ----------- | ----- | ---------- | ---- | ------------- | --- | ------- | --- | -------- | -------- | ---------------------------------------------- |
| 1  | SIMPLE      | ino   |            | ALL  |               |     |         |     | 5        | 100.00   |                                                |
| 1  | SIMPLE      | br    |            | ALL  | idx_datetime  |     |         |     | 31865381 | 1.23     | Range checked for each record (index map: 0x2) |

Table creation:

CREATE TABLE `projectdb` (
   `project` varchar(10) DEFAULT NULL,
   `date` datetime DEFAULT NULL,
   `enddate` datetime DEFAULT ((`date` + interval 3 day)),
   `vessel` varchar(45) DEFAULT NULL,
   UNIQUE KEY `project_UNIQUE` (`project`),
   KEY `idx_batch` (`project`) USING BTREE
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;


CREATE TABLE `milestones` (
   `id` int NOT NULL AUTO_INCREMENT,
   `datetime` datetime DEFAULT CURRENT_TIMESTAMP,
   `sensor` varchar(20) DEFAULT NULL,
   `reading` decimal(10,4) DEFAULT NULL,
   `monitor` tinyint DEFAULT NULL,
   PRIMARY KEY (`id`),
   KEY `idx_datetime` (`datetime` DESC) USING BTREE,
   KEY `idx_sensors` (`sensor`) USING BTREE
 ) ENGINE=InnoDB AUTO_INCREMENT=31649068 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

indexes:

| Table      | Non_unique | Key_name       | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
| ---------- | ---------- | -------------- | ------------ | ----------- | --------- | ----------- | -------- | ------ | ---- | ---------- | ------- | ------------- | ------- | ---------- |
| projectdb  | 0          | project_UNIQUE | 1            | project     | A         | 4           |          |        | YES  | BTREE      |         |               | YES     |            |
| projectdb  | 1          | idx_project    | 1            | project     | A         | 5           |          |        | YES  | BTREE      |         |               | YES     |            |
| milestones | 0          | PRIMARY        | 1            | id          | A         | 31648118    |          |        |      | BTREE      |         |               | YES     |            |
| milestones | 1          | idx_datetime   | 1            | datetime    | D         | 1723411     |          |        | YES  | BTREE      |         |               | YES     |            |
| milestones | 1          | idx_sensors    | 1            | sensor      | A         | 21924       |          |        | YES  | BTREE      |         |               | YES     |            |

projectdb table sample:

| project   | date             | enddate          | vessel |
| --------- | ---------------- | ---------------- | ------ |
| Project 1 | 24/11/2023 17:30 | 27/11/2023 17:30 | V1     |
| Project 2 | 17/11/2023 19:50 | 20/11/2023 19:50 | V1     |
| Project 3 | 27/10/2023 16:00 | 30/10/2023 16:00 | V2     |

milestones table sample

| id       | datetime         | sensor   | reading |
| -------- | ---------------- | -------- | ------- |
| 22117821 | 10/10/2023 19:20 | V1_FT001 | 100     |
| 22118005 | 10/10/2023 19:21 | V2_FT001 | 120     |
| 22118189 | 10/10/2023 19:23 | V1_FT001 | 100     |
| 22117835 | 10/10/2023 19:20 | V3_FT001 | 105     |

Looking at the execution plan, it's evident that the index on db.milestones.sensor and db.milestones.datetime are not being used. This means that the full table scan needs to go through all 36 million rows, which explains the poor performance. I'm assuming because this is time series data, I have 81 sensors repeating over and over, making the index less likely to be used.

Is there a way to modify this query to make it so the index is always used, or is it a matter of taking an entirely different approach if I want to improve query speeds?

I tried using a subquery in the first join to include the WHERE clause at the end to filter out unnecessary data before joining. This approach ended up being the worst performing taking up to 2 minutes.

I tried to create a view of the db.milestones table to contain the substring which the WHERE clause is being compared against, so I could use = instead of LIKE. This also did not go as expected, the performance was very bad as well.

I tried to use FORCE INDEX to use the index but this didn't work.

Interestingly, if I reduce the conditions in WHERE project IN... , the index on db.milestone.sensoris used and the processing time is much better.


Solution

  • Currently, you have two indices on the project column, one of which is unique, and there is no PK. You should drop the two existing indices and add PK on project.

    The key to reasonable performance is the composite key on (sensor, datetime) and a concrete relationship between vessel and sensor.

    Adding a simple join table (vessel_sensor) would make the world of difference:

    vessel sensor
    V1 V1_FT001
    V2 V2_FT001
    V3 V3_FT001

    And the composite index:

    ALTER TABLE milestones ADD INDEX idx_sensor_datetime (sensor, datetime);
    

    Then add the join table to your existing query:

    SELECT 
        m.reading,
        CONCAT(p.project, '_', SUBSTRING_INDEX(m.sensor, '_', -1)) AS new_metric,
        p.vessel,
        ADDTIME(DATE('2023-01-01'), -TIMEDIFF(p.date, m.datetime)) AS norm_date,
        m.sensor
    FROM projects p
    JOIN vessel_sensor vs
        ON vs.vessel = p.vessel
    JOIN milestones m
        ON m.sensor = vs.sensor
        AND m.datetime BETWEEN p.date AND p.enddate
    WHERE p.project IN ( 'project1', 'project2', 'project3', 'project4' );