Search code examples
mysqlmemory-managementinnodbclustered-index

Database memory and disk work assignation


I was reading ebook chapter about indexes, and indexing strategies, many of these aspects I already know, but I stucked on clustered indexes in InnoDB, here is the quote:

Clustering gives the largest improvement for I/O-bound workloads. If the data fits in memory the order in which it’s accessed doesn’t really matter, so clustering doesn’t give much benefit.

I belive that this is truth, but how am I supposed to guess if the data would fit in memory? How the database decide when to process the data in-memory, and when not?

Let's say we have a table Emp with columns ID, Name, and Phone filled with 100 000 records

If, one example, I will put the clustered index on the ID column, and perform this query

SELECT * FROM Employee;

How do I know if this will use a benefits from clustered index?

It's somehow relative to this thread Difference between In memory databases and disk memory database

but yet I am not sure how the database will behave


Solution

  • Your example might be 20MB.

    "In memory" really means "in the InnoDB buffer_pool", whose size is controlled by innodb_buffer_pool_size, which should be set to about 70% of available RAM.

    If your query hits the disk instead of finding everything cached in the buffer_pool, it will run (this is just a Rule of Thumb) 10 times as slow.

    What you are saying on "clustered index" is misleading. Let me turn things around...

    • InnoDB really needs a PRIMARY KEY.
    • A PK is (by definition in MySQL) UNIQUE.
    • There can be only one PK for a table.
    • The PK can be a "natural" key composed of one (or more) columns that 'naturally' work.
    • If you don't have a "natural" choice, then use id INT UNSIGNED NOT NULL AUTO_INCREMENT.
    • The PK and the data are stored in the same BTree. (Actually a B+Tree.) This leads to "the PK is clustered with the data".

    The real question is not whether something is clustered, but whether it is cached in RAM. (Remember the 10x RoT.)

    • If the table is small, it will stay in cache (once all its blocks are touched), hence avoid disk hits.
    • If some subset of a huge table is "hot", it will tend to stay in cache.
    • If you must access a huge table "randomly", you will suffer a slowdown due to lots of disk hits. (This happens when using UUIDs as PRIMARY KEY or other type of INDEX.)

    How the database decide when to process the data in-memory, and when not?

    That's 'wrong', too. All processing is in memory. On a block-by-block basis, pieces of the tables and indexes are moved into / out of the buffer_pool. A block (in InnoDB) is 16KB. And the buffer_pool is a "cache" of such blocks.

    SELECT * FROM Employee;
    

    is simple, but costly. It operates thus:

    1. "Open" table Employee (if not already open -- a different 'cache' handles this).
    2. Go to the start of the table. This involves drilling down the left side of the PK's BTree to the first leaf node (block). And fetch it into the buffer_pool if not already cached.
    3. Read a row -- this will be in that leaf node.
    4. Read next row -- this is probably in the same block. If not, get the 'next' block (read from disk if necessary).
    5. Repeat step 4 until finished with the table.

    Things get more interesting if you have a WHERE clause. And then it depends on whether the PK or some other INDEX is involved.

    Etc, etc.