mysql date amazon-rds partitioning database-partitioning

MySql partitioned table - select between date range using PK very slow compared to date IN (...) clause

I have a very large table consisting of approximately 3 Million records per day.

The following query is so slow

EXPLAIN SELECT *
FROM summary_by_to_days_range 
WHERE(record_date BETWEEN '2019-03-12' AND '2019-03-15')
AND unit_id = 1148210
AND enum_key IN (9, 10, 38, 311)
GROUP BY unit_id, record_date
ORDER BY record_date DESC;

With the following results:

+---------+----------+-------------+---------------+---------+-----------------------------------------------------+
|  rows   | filtered |    Extra    | possible_keys |   key   |                     partitions                      |
+---------+----------+-------------+---------------+---------+-----------------------------------------------------+
| 9072566 |        4 | Using where | PRIMARY       | PRIMARY | from20190312,from20190313,from20190314,from20190315 |
+---------+----------+-------------+---------------+---------+-----------------------------------------------------+

Compared to

EXPLAIN SELECT *
FROM summary_by_to_days_range 
WHERE(record_date IN ('2019-03-12','2019-03-13','2019-03-14','2019-03-15'))
AND unit_id = 1148210
AND enum_key IN (9, 10, 38, 311)
GROUP BY unit_id, record_date
ORDER BY record_date DESC;

With much better results:

+------+----------+-------------+---------------+---------+-----------------------------------------------------+
| rows | filtered |    Extra    | possible_keys |   key   |                     partitions                      |
+------+----------+-------------+---------------+---------+-----------------------------------------------------+
|   16 |      100 | Using where | PRIMARY       | PRIMARY | from20190312,from20190313,from20190314,from20190315 |
+------+----------+-------------+---------------+---------+-----------------------------------------------------+

And I can't understand why.. I'm supplying PK values, the only difference is the Between date clause!

The Table schema

  `CREATE TABLE summary_by_to_days_range (
  `record_date` date NOT NULL,
  `unit_id` int(11) NOT NULL,
  `enum_key` int(11) NOT NULL,
  `str_value` varchar(200) DEFAULT NULL,
  PRIMARY KEY (`record_date`,`unit_id`,`enum_key`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
PARTITION BY RANGE (TO_DAYS(record_date))
(PARTITION START_h VALUES LESS THAN (0) ENGINE = InnoDB,
 PARTITION from20181231 VALUES LESS THAN (737425) ENGINE = InnoDB,
 PARTITION from20190101 VALUES LESS THAN (737426) ENGINE = InnoDB,
.
.
PARTITION future VALUES LESS THAN MAXVALUE ENGINE = InnoDB)`

I also tried partitioning by key, by range column by hash of DAYOFYEAR() all resulting with the same disappointing result..

Anyone?

Solution

Usually it is inefficient to start the PRIMARY KEY with the column(s) of the "partition key". After all, first comes the "partition pruning"; why then filter on the same stuff?

Usually it is inefficient to start any composite index with a column that will be used in a "range" test. It is subtle, but I think this explains the difference you are seeing. With IN (your second query), more fields in the PK can be used, thereby running faster.

And, no, the Optimizer is smart enough to understand how "dates" work. Otherwise, it could do your second query as fast as the first. (This crudely addresses some of the Comments.)

(for reference)

WHERE record_date BETWEEN '2019-03-12' AND '2019-03-15'
  AND unit_id = 1148210
  AND enum_key IN (9, 10, 38, 311)

`record_date` date NOT NULL,
PRIMARY KEY (`record_date`,`unit_id`,`enum_key`)

PARTITION BY RANGE (TO_DAYS(record_date))

Let's address both of my comments as follows.

Change to

PRIMARY KEY(unit_id, enum_key, record_date)

With this PK, either of your SELECTs will

Prune down to the 4 partitions (as before)
Quickly filter down to the desired unit_id. (I suspect this is where most of the efficiency is.)
Jump through the rows for enum_key
Check that record_date is correct.

I'm glad to set the start and future partitions. (Perhaps you have already read this.)

A note: It can be somewhat inefficient to have more than about 50 partitions. If you have (or will have) more than that, consider using weekly or monthly partitions. This will have a slight impact on my PK, but not until step 4.

As for PARTITION BY HASH... I have found no performance gain using it. (Or at least none that could not be achieved some other way.)