Search code examples
mysqlquery-optimizationpartitioning

Why this simple query does not run optimal in mysql using partitions?


Under Mysql version 5.7.31, Ubuntu 18.04 8core cpu system with 24gb of ram, I have a table named mytable. It is partitioned in 30 partitions using a HASH(YEAR(created_at)) function on a datetime field named created_at. It consists of a total of 185378420 records. It has a primary key using auto_increment named id.

Table structure:

CREATE TABLE `mytyable` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `created_at` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
  PRIMARY KEY (`id`,`created_at`),
  KEY `created_at` (`created_at`),
) ENGINE=MyISAM AUTO_INCREMENT=194156422 DEFAULT CHARSET=utf8
/*!50100 PARTITION BY HASH (YEAR(created_at))
PARTITIONS 30 */

I am trying to run this simple query.

SELECT * FROM `mytable` WHERE `id` IN (194070462,194070461) ORDER BY `id` DESC;

I cannot understand why it gets stucked in sending data phase, while I am not sure if it ever completes succesfully.

On the other hand if i change the sort direction it runs very fast (in terms of miliseconds).

SELECT * FROM `mytable` WHERE `id` IN (194070462,194070461) ORDER BY `id` ASC;

And also if I remove the order by clause competely

SELECT * FROM `mytable` WHERE `id` IN (194070462,194070461);

Both queries produce the expected output which is:

+-----------+---------------------+
| id        | created_at          |
+-----------+---------------------+
| 194070461 | 2021-07-26 21:16:47 |
| 194070462 | 2021-07-26 21:16:47 |
+-----------+---------------------+

I believe it should be able to first gather the data and then try to sort them.

The explain is exactly the same for all 3 queries:

+----+-------------+-------+---------------------------------------------------------------------------------------------------------------+-------+---------------+---------+---------+------+------+----------+-----------------------+
| id | select_type | table | partitions                                                                                                    | type  | possible_keys | key     | key_len | ref  | rows | filtered | Extra                 |
+----+-------------+-------+---------------------------------------------------------------------------------------------------------------+-------+---------------+---------+---------+------+------+----------+-----------------------+
|  1 | SIMPLE      | mytable | p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15,p16,p17,p18,p19,p20,p21,p22,p23,p24,p25,p26,p27,p28,p29 | range | PRIMARY       | PRIMARY | 4       | NULL |   20 |   100.00 | Using index condition |
+----+-------------+-------+---------------------------------------------------------------------------------------------------------------+-------+---------------+---------+---------+------+------+----------+-----------------------+

Any idea what is happening behind the curtains? Is it somehow a partitioning limitation, because of the combination of range query and order by desc clause? Is there any workouround for this?

{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "9.62"
    },
    "ordering_operation": {
      "using_filesort": false,
      "table": {
        "table_name": "mytable",
        "partitions": [
          "p0",
          "p1",
          "p2",
          "p3",
          "p4",
          "p5",
          "p6",
          "p7",
          "p8",
          "p9",
          "p10",
          "p11",
          "p12",
          "p13",
          "p14",
          "p15",
          "p16",
          "p17",
          "p18",
          "p19",
          "p20",
          "p21",
          "p22",
          "p23",
          "p24",
          "p25",
          "p26",
          "p27",
          "p28",
          "p29"
        ],
        "access_type": "range",
        "possible_keys": [
          "PRIMARY"
        ],
        "key": "PRIMARY",
        "used_key_parts": [
          "id"
        ],
        "key_length": "4",
        "rows_examined_per_scan": 20,
        "rows_produced_per_join": 20,
        "filtered": "100.00",
        "using_index": true,
        "cost_info": {
          "read_cost": "5.62",
          "eval_cost": "4.00",
          "prefix_cost": "9.62",
          "data_read_per_join": "11K"
        },
        "used_columns": [
          "id",
          "created_at"
        ],
        "attached_condition": "(`mydb`.`mytable`.`id` in (194070462,194070461))"
      }
    }
  }
}

Solution

  • PARTITION BY HASH is, in my opinion, totally useless, especially for performance. Ditto for SUBPARTITIONing and any method other than BY RANGE. Even then there are very few cases where it is useful.

    In your situation...

    Is id the PRIMARY KEY (or at least the first column in the PK or some secondary index? If so, then:

    PARTITIONing by anything other than id will involve looking in every partition. Sure, it will use an index once it is there. But that is 60 lookups to find 2 rows. (60 = 2 items in IN times 30 partitions). Apparently, it is even worse than this -- since it is "stuck", as you put it.

    Without partitioning, it would be a simple 2 lookups.

    PARTITION BY RANGE(YEAR(created_at)) would be better than BY HASH, but not much better. There are very few queries that would make effective use of "partition pruning". And Hash can never us a "range" of dates for pruning.

    Please show me SHOW CREATE TABLE and the main queries that are applied. I'll advise on the optimal set of indexes to use and whether or not Partitioning of any benefit. Another question: Will you be purging "old" rows? (Purging via DROP PARTITION is the main use for BY RANGE on a date. But that only helps the DELETE, not SELECTs.)

    Change to InnoDB!