Search code examples
mysqlquery-optimizationreportingtimezone-offset

MySQL report query optimization and timezone issues


I'm faced with a MySQL database which contains an events table with ~70 million rows which has foreign keys to other tables and is used to generate reports. Constructing a performant query to select (while counting/summing values) and grouping data per day from this table is proving challenging.

The database structure is as follows:

CREATE TABLE `client` (
  `id` int NOT NULL AUTO_INCREMENT,
  `name` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_client_id_name` (`id`,`name`)
) ENGINE=InnoDB AUTO_INCREMENT=66 DEFAULT CHARSET=utf8mb3

CREATE TABLE `class` (
  `id` int NOT NULL AUTO_INCREMENT,
  `name` varchar(255) DEFAULT NULL,
  `client_id` int DEFAULT NULL,
  `duration` int DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `fk_client_id_idx` (`client_id`),
  CONSTRAINT `fk_client_id` FOREIGN KEY (`client_id`) REFERENCES `client` (`id`) ON DELETE SET NULL ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=2606 DEFAULT CHARSET=utf8mb3


CREATE TABLE `event` (
  `id` int NOT NULL AUTO_INCREMENT,
  `start_time` datetime DEFAULT NULL,
  `class_id` int DEFAULT NULL,
  `venue_id` int DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `fk_class_id_idx` (`class_id`),
  KEY `fk_venue_id_idx` (`venue_id`),
  KEY `idx_1` (`venue_id`,`class_id`,`start_time`),
  CONSTRAINT `fk_class_id` FOREIGN KEY (`class_id`) REFERENCES `class` (`id`) ON DELETE SET NULL ON UPDATE CASCADE,
  CONSTRAINT `fk_venue_id` FOREIGN KEY (`venue_id`) REFERENCES `venue` (`id`) ON DELETE SET NULL ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=64093231 DEFAULT CHARSET=utf8mb3

CREATE TABLE `venue` (
  `id` int NOT NULL AUTO_INCREMENT,
  `name` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_venue_id_name` (`id`,`name`)
) ENGINE=InnoDB AUTO_INCREMENT=29 DEFAULT CHARSET=utf8mb3

The query which is fine on an events table with a few thousand rows to demonstrate the desired outcome is as follows:

SELECT 
    CAST(event.start_time as date) as day,
    class.name,
    client.name,
    venue.name,
    COUNT(class.name) AS occurrences,
    SUM(class.duration) AS duration
FROM
    class,
    client,
    event,
    venue
WHERE
    event.venue_id = venue.id
    AND event.class_id = class.id
    AND class.client_id = client.id
GROUP BY day, class.name, client.name, venue.name

The database isn't indexed and although I've tried indexing with things like alter table events add index idx_test (venue_id, class_id, start_time); to improve performance it's still incredibly slow (I tend to abort them when they're past the 10 minute mark so don't know for sure how long they'd take to complete).

I figured this was a good use case for a summary table (as suggested by Rick James' guide) so that I could hold a separate set of summarized data broken down into day with occurrences and total duration calculated/incremented with each addition to the table (IODKU). However I'm then also up against creating rows per day in a summary table based on what is considered a day in the database (UTC) which may not match with the application's "day" due to timezone offset. Short of converting the start_time column to a timestamp type (which is then inconsistent with all other date types in the database) is there any way round this or is there any other optimization I could be making to the original events table resulting in a more responsive query? TIA

Update 23/05

Here's the buffer pool size:

SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
+-------------------------+-----------+
| Variable_name           | Value     |
+-------------------------+-----------+
| innodb_buffer_pool_size | 134217728 |
+-------------------------+-----------+

I've also made a bit of progress with indexing, modifying the query and creating a summary table. I tried various ordering of columns to test indexes and found idx_event_venueid_classid_starttime (below), to be the most efficient for the event table:

SHOW INDEXES FROM EVENT;
+-------+------------+-------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name                            | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+-------+------------+-------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| event |          0 | PRIMARY                             |            1 | id          | A         |    62142912 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| event |          1 | fk_class_id_idx                     |            1 | class_id    | A         |       51286 |     NULL |   NULL | YES  | BTREE      |         |               | YES     | NULL       |
| event |          1 | fk_venue_id_idx                     |            1 | venue_id    | A         |       16275 |     NULL |   NULL | YES  | BTREE      |         |               | YES     | NULL       |
| event |          1 | idx_event_venueid_classid_starttime |            1 | venue_id    | A         |       13378 |     NULL |   NULL | YES  | BTREE      |         |               | YES     | NULL       |
| event |          1 | idx_event_venueid_classid_starttime |            2 | class_id    | A         |       81331 |     NULL |   NULL | YES  | BTREE      |         |               | YES     | NULL       |
| event |          1 | idx_event_venueid_classid_starttime |            3 | start_time  | A         |    63909472 |     NULL |   NULL | YES  | BTREE      |         |               | YES     | NULL       |
+-------+------------+-------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+

Here's my modified version of the query, using JOIN syntax and now uses CONVERT_TZ to convert from UTC to the timezone required for reporting and then group that by the date (discarding the time portion):

SELECT
  DATE(CONVERT_TZ(event.start_time,
    'UTC',
    'Europe/London')) AS tz_date,
  class.name,
  client.name,
  venue.name,
  COUNT(class.id) AS occurrences,
  SUM(class.duration) AS duration
FROM
  event
JOIN
  class ON class.id = event.class_id
JOIN
  venue ON venue.id = event.venue_id
JOIN
  client ON client.id = class.client_id
GROUP BY tz_date, class.name, client.name, venue.name;

And here's the output of explain for that query:

+----+-------------+--------+------------+--------+---------------------------------------------------------------------+-------------------------------------+---------+-------------------------+------+----------+------------------------------+
| id | select_type | table  | partitions | type   | possible_keys                                                       | key                                 | key_len | ref                     | rows | filtered | Extra                        |
+----+-------------+--------+------------+--------+---------------------------------------------------------------------+-------------------------------------+---------+-------------------------+------+----------+------------------------------+
|  1 | SIMPLE      | venue  | NULL       | index  | PRIMARY,idx_venue_id_name                                           | idx_venue_id_name                   | 772     | NULL                    |   28 |   100.00 | Using index; Using temporary |
|  1 | SIMPLE      | event  | NULL       | ref    | fk_class_id_idx,fk_venue_id_idx,idx_event_venueid_classid_starttime | idx_event_venueid_classid_starttime | 5       | example.venue.id        | 4777 |   100.00 | Using where; Using index     |
|  1 | SIMPLE      | class  | NULL       | eq_ref | PRIMARY,fk_client_id_idx                                            | PRIMARY                             | 4       | example.event.class_id  |    1 |   100.00 | Using where                  |
|  1 | SIMPLE      | client | NULL       | eq_ref | PRIMARY,idx_client_id_name                                          | PRIMARY                             | 4       | example.class.client_id |    1 |   100.00 | NULL                         |
+----+-------------+--------+------------+--------+---------------------------------------------------------------------+-------------------------------------+---------+-------------------------+------+----------+------------------------------+

The query takes ~1m 20s to run now so I figured I could prepend that with an insert into to populate a summary table with the dates being timezone specific and run that on a nightly basis. Summary table structure:

CREATE TABLE `summary` (
  `tz_date` date NOT NULL,
  `class` varchar(255) NOT NULL,
  `client` varchar(255) NOT NULL,
  `venue` varchar(255) NOT NULL,
  `occurrences` int NOT NULL,
  `duration` int NOT NULL,
  PRIMARY KEY (`tz_date`,`class`,`client`,`venue`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3

From the original ~60m+ rows in the event table, the aggregated summary table is populated with ~66k rows. To then generate the reports from the summary table it takes a fraction of a second (shown below with data snipped):

SELECT * FROM SUMMARY;
66989 rows in set (0.03 sec)

I haven't looked into the impact of inserting into event while the query to populate the summary table is running - is using InnoDB likely to slow that down?


Solution

  • No further indexes are likely to help. It need to scan all the events table, reaching into the other tables to get the names.

    Some things for us to look at:

    SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
    EXPLAIN SELECT ...
    

    How much RAM do you have?

    Do the aggregates (COUNT and SUM) look correct? In some situations involving JOIN, they can be over-inflated.

    Please use the newer JOIN ... ON syntax. (Won't change performance.)

    As you observed, a Summary Table may help -- but only of the older data is not being modified. Please provide the SHOW CREATE TABLE and query for it.

    Yes, timezone vs "definition of day" is a thorny issue. Notice how StackOverflow defines day based on UTC.

    How many new rows are there per day? Are they spread out somewhat evenly throughout the day? If the average number of rows per hour is at least 20, then the Summary Table could be based on half-hour intervals. (I picked that because of India time vs most of the rest of the world.) The 20 comes from a Rule of Thumb that says that a summary table should have one-tenth as many rows as the Fact table.

    Yes, TIMESTAMP instead of DATETIME may be a workaround.

    Since you are talking about moderately large tables, consider whether to change INT NULL to SMALLINT UNSIGNED NOT NULL or some other sized integer.

    (As for the cliff in 2038, ask yourself how many databases have been active on the same hardware and software since 2006. That may give some perspective on whether your design must survive 16 years.)