Search code examples
mysqlindexinginnodb

InnoDB clustered index performance when using random values as primary key


By default, my primary keys of InnoDB storage engines are auto increment integers. In effort, to hide the number of rows in the database, application code implements some random generator for the primary key.

This is an example of typical scheme:

CREATE TABLE `MUSIC_LINK` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `project_id` int(11) unsigned DEFAULT NULL,
   PRIMARY KEY (`id`),
) ENGINE=InnoDB AUTO_INCREMENT=15 DEFAULT CHARSET=latin1;

I am reading a book and have just found out, that InnoDB uses clustered index sorted by primary key. In essence, this means that the records in database file are indexed and ordered by the primary key value. This is great, as long as the primary key for next record is always bigger than the last record inserted (this happens by default because of auto-increment constraint).

What happens when the primary key is no longer auto-incremented? In order to keep the file sorted by primary key, there must be lots of rewrites every time a primary key smaller than the biggest primary key is inserted.

Am I misunderstanding how clustered indexes work in InnoDB? because this sounds like a giant performance issue.


Solution

  • InnoDB:

    With an AUTO_INCREMENT PRIMARY KEY, the "next" row will be put at the "end" of the BTree that holds the data for the table. This is efficient, and the "last" block will be updated a lot.

    Note: Blocks are kept in the buffer_pool, to be eventually written to disk.

    With a "random" PK, such as GUID, UUID, MD5, SHA1, etc, the "next" row to be inserted needs to go into some 'random' place in the BTree that holds the data. If the buffer_pool is big enough, then the necessary block will still be sitting in it. So the efficiency is not much different than with AI.

    On the other hand, if the data is too big to fit in the buffer_pool (or other activity keeps bumping the blocks out), then an insert will need to fetch the block before modifying it.

    If, for example, the table is 20 times as big as can be held in the buffer_pool, then the next random write will have a chance of 1 in 20 of the block being cached. That is, 95% of the time an INSERT has to wait for a disk read.

    But... You prompted a discussion of INSERTs. What about SELECTs? What, if any, pattern is there to the selects? If it is 'random' anyway, then the type of PK does not matter. If, on the other hand, the selects tend to reach for "recent" items (eg, news articles), then AI wins for large tables because of the increased likelihood of the desired block being cached.

    Cluster

    A Comment implies some confusion over "cluster/ed/ing". Some definitions (in a MySQL/MariaDB context):

    • A group of servers with identical data, working together. NDB Cluster vs Galera Cluster vs Clustrix (3rd party offering)
    • A "clustered index" is when the data is attached to the index. In InnoDB, the PK is always clustered with the data. (Note: MyISAM, and other vendors do not necessarily do this.)
    • When records to be fetched are next to each other in the layout on disk (think the PK or a secondary index), then those rows are "clustered together". This is worth noting because fetching one block gets several rows that you need.

    So, back to the Comment:

    • Jumping around in the PRIMARY KEY (due to using what I called a random PK, or due to simply not fetching rows in some relevant order) is stuck with jumping around in the table.
    • A UUID has a "sorted order", but it is not useful to much of anything.