Search code examples
mysqlsqlindexingclustered-index

Why does InnoDB require clustered index upon creating a table?


Even if I don't have a primary key or unique key, InnoDB still creates a cluster index on a synthetic column as described below.

https://dev.mysql.com/doc/refman/5.5/en/innodb-index-types.html

So, why does InnoDB have to require clustered index? Is there a defenite reason clustered index must exist here?

In Oracle Database or MSSQL I don't see they require this. Also, I don't think cluster index have so tremendous advantage comparing to ordinary table either.

It is true that looking for data using clustering key does not need an additional disk read and faster than when I don't have one but without cluster index, secondary index can look up faster by using physical rowID. Therefore, I don't see any reason for insisting using it.


Solution

  • Other vendors have a "ROWNUM" or something like that. InnoDB is much simpler. Instead of having that animal, it simply requires something that you will usually want anyway. In both cases, it is a value that uniquely identifies a row. This is needed for guts of transactions -- knowing which row(s) to lock, etc, to provide transactional integrity. (I won't go into the rationale here.)

    In requiring (or providing) a PK, and in doing certain other simplifications, InnoDB sacrifices several little-used (or easily worked around) features: Multiple pks, multiple clustered indexes, no pk, etc.

    Since the "synthetic column" takes 6 bytes, it is almost always better to simply provide id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, even if you don't use it. But if you don't use it, but do have a non-NULL UNIQUE key, then you may as well make it the PK. (As MySQL does by default.)

    A lookup by a secondary key first gets the PK value from the secondary key's BTree. Then the main BTree (with the data ordered by the PK) is drilled down to find the row. Hence, secondary keys can be slower that use of the PK. (Usually this is not enough slower to matter.) So, this points out one design decision that required a PK.) (Other vendors use ROWNUM, or something, to locate the record, instead of the PK.)

    Back to "Why?". There are many decisions in MySQL where the designers said "simplicity is better for this free product, let's not bother building some complex, but little-used feature. At first there were no subqueries (temp tables were a workaround). No Views (they are only syntactic sugar). No Materialized Views (OK, this may be a failing; but they can be simulated). No bit-mapped or hash or isam (etc) indexing (BTree is very good for "all-around" usage).

    Also, by always "clustering" the PK with the data, lookups via the PK are inherently faster than the competition (no going through a ROWNUM). (Secondary key lookups may not be faster.)

    Another difference -- MySQL was very late in implementing "index merge", wherein it uses two indexes, then ANDs or ORs the results. This can be efficient with ROWNUMs, but not with clustered PKs.

    (I'm not a MySQL/MariaDB/Percona developer, but I have used them since 1999, and have been to virtually all major MySQL Conferences, where inside info is often divulged. So, I think I have enough insight into their thinking to present this answer.)