Search code examples
mysqlmysql-8.0

What is the difference between a clustered primary index and partitioning for organising data in a mysql table


I have a table table1 which has the primary key: PRIMARY(year, month, id).

From what i understand regarding this primary key is that the binary tree will have data stored next to each other organised by the year, month, id.

(2021 12, 1)
(2022, 12, 1)
(2022, 12, 2)
(2023, 1, 1)
CREATE TABLE `table1` (
  `id` int AUTO_INCREMENT NOT NULL,
  `entity_id` varchar(36) NOT NULL,
  `entity_type` varchar(36) NOT NULL,
  `score` decimal(4,3) NOT NULL,
  `raw` json DEFAULT NULL,
  `month` int NOT NULL,
  `year` int NOT NULL,
  `date` DATE NOT NULL,
  `created_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
  `updated_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `deleted_at` timestamp NULL DEFAULT NULL,
  PRIMARY KEY (`year`, `month`, `id`),
  KEY (`id`),
  KEY `table1_indx` (`year`, `month`,`score`,`entity_type`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

If i then have a query which focuses on searching via date(month,year) it will be efficient because the data is clustered and organised together.

EXPLAIN
SELECT
  table1.entity_id AS entity_id,
  table1.entity_type,
  table1.score
FROM table1
WHERE table1.month = 12
  AND table1.year = 2022
  AND table1.score > 0
  AND table1.entity_type IN ('type1', 'type2', 'type3', 'type4');

If my assumption about the above is correct and organisation of data is correct, what is different that happens if the table instead partitioned by year and subpartitioned by month.

CREATE TABLE `table1` (
  `id` int AUTO_INCREMENT NOT NULL,
  `entity_id` varchar(36) NOT NULL,
  `entity_type` varchar(36) NOT NULL,
  `score` decimal(4,3) NOT NULL,
  `raw` json DEFAULT NULL,
  `month` int NOT NULL,
  `year` int NOT NULL,
  `date` DATE NOT NULL,
  `created_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
  `updated_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `deleted_at` timestamp NULL DEFAULT NULL,
  PRIMARY KEY (`year`, `month`, `id`),
  KEY (`id`),
  KEY `table1_indx` (`year`, `month`,`score`,`entity_type`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
PARTITION BY RANGE (`year`)
SUBPARTITION BY HASH (`month`)
(PARTITION p2021 VALUES LESS THAN (2022)
 (SUBPARTITION dec_2021 ENGINE = InnoDB,
  SUBPARTITION jan_2021 ENGINE = InnoDB,
  SUBPARTITION feb_2021 ENGINE = InnoDB,
  SUBPARTITION mar_2021 ENGINE = InnoDB,
  SUBPARTITION apr_2021 ENGINE = InnoDB,
  SUBPARTITION may_2021 ENGINE = InnoDB,
  SUBPARTITION jun_2021 ENGINE = InnoDB,
  SUBPARTITION jul_2021 ENGINE = InnoDB,
  SUBPARTITION aug_2021 ENGINE = InnoDB,
  SUBPARTITION sep_2021 ENGINE = InnoDB,
  SUBPARTITION oct_2021 ENGINE = InnoDB,
  SUBPARTITION nov_2021 ENGINE = InnoDB),
 PARTITION p2022 VALUES LESS THAN (2023)
 (SUBPARTITION dec_2022 ENGINE = InnoDB,
  SUBPARTITION jan_2022 ENGINE = InnoDB,
  SUBPARTITION feb_2022 ENGINE = InnoDB,
  SUBPARTITION mar_2022 ENGINE = InnoDB,
  SUBPARTITION apr_2022 ENGINE = InnoDB,
  SUBPARTITION may_2022 ENGINE = InnoDB,
  SUBPARTITION jun_2022 ENGINE = InnoDB,
  SUBPARTITION jul_2022 ENGINE = InnoDB,
  SUBPARTITION aug_2022 ENGINE = InnoDB,
  SUBPARTITION sep_2022 ENGINE = InnoDB,
  SUBPARTITION oct_2022 ENGINE = InnoDB,
  SUBPARTITION nov_2022 ENGINE = InnoDB),
 PARTITION p2023 VALUES LESS THAN (2024)
 (SUBPARTITION dec_2023 ENGINE = InnoDB,
  SUBPARTITION jan_2023 ENGINE = InnoDB,
  SUBPARTITION feb_2023 ENGINE = InnoDB,
  SUBPARTITION mar_2023 ENGINE = InnoDB,
  SUBPARTITION apr_2023 ENGINE = InnoDB,
  SUBPARTITION may_2023 ENGINE = InnoDB,
  SUBPARTITION jun_2023 ENGINE = InnoDB,
  SUBPARTITION jul_2023 ENGINE = InnoDB,
  SUBPARTITION aug_2023 ENGINE = InnoDB,
  SUBPARTITION sep_2023 ENGINE = InnoDB,
  SUBPARTITION oct_2023 ENGINE = InnoDB,
  SUBPARTITION nov_2023 ENGINE = InnoDB),
 PARTITION pmax VALUES LESS THAN MAXVALUE
 (SUBPARTITION dec_max ENGINE = InnoDB,
  SUBPARTITION jan_max ENGINE = InnoDB,
  SUBPARTITION feb_max ENGINE = InnoDB,
  SUBPARTITION mar_max ENGINE = InnoDB,
  SUBPARTITION apr_max ENGINE = InnoDB,
  SUBPARTITION may_max ENGINE = InnoDB,
  SUBPARTITION jun_max ENGINE = InnoDB,
  SUBPARTITION jul_max ENGINE = InnoDB,
  SUBPARTITION aug_max ENGINE = InnoDB,
  SUBPARTITION sep_max ENGINE = InnoDB,
  SUBPARTITION oct_max ENGINE = InnoDB,
  SUBPARTITION nov_max ENGINE = InnoDB))

Ultimately im trying to understand if both primary(year,month,id) as well as partitioning the data is "worthwhile".


Solution

  • Think of partitions as if they are individual tables. In fact, if you use innodb_file_per_table, each partition is stored in its own tablespace file under the data directory.

    Each partition has its own B-tree for the clustered index, just like any table.

    Even if your query does partition pruning so it only scans a single partition, that partition might be large enough that a linear scan is too time-consuming.

    After the partition pruning does its job to reduce the partitions to scan, each partition may use an index. The optimizer's choice about this is exactly the same as for a non-partitioned table. Using an index helps to reduce the rows examined within each respective partition.

    So it's definitely an advantage to have indexes to improve optimization for your queries, whether you use partitioning or not.

    This brings up the question of whether partitioning is even worth it, if you have indexes that optimize the query. In fact, in many cases, using indexes alone, without partitioning, is plenty of optimization. The index helps to narrow the search only to matching rows, just like partitioning does.