Search code examples
mysqlsqlclustered-indexnon-clustered-index

Will DATE in primary key gain performance in range scans?


We have a large (and growing) MySQL db of bird records. At present we have three main tables, which simplified looks as follows:

RECORDS (id, birdid, tripid, gender, age) PRIMARY KEY id
BIRDTRIPS (id, locid, date, starttime, endtime, notes) PRIMARY KEY id
LOCATIONS (id, description, centerlat, centerlng) PRIMARY KEY id

All ids are INTs and date is DATE type.

In order to gain performance i´ve decided to do some denormalization by replicating date and locid into RECORDS:

RECORDS (id, birdid, tripid, gender, age, locid, date) PRIMARY KEY id

This way, many queries will avoid expensive joins to BIRDTRIPS and LOCATIONS.

MySQL has only one clustered index per table, and this is always the primary key. I am tempted
to try PRIMARY KEY (date, id) on RECORDS to exploit the clustered index for faster range scans on the date column and partitioning of the table. The id is only included in the key because there might be many records on the same date. Theoretically the primary key is often kind of "wasted" on the id, where a non-clustered UNIQUE index would be sufficient for lookups.

And now my question:

Do any of you have practical experience with this approach? Are there any drawbacks i might have overlooked?


Solution

  • I will answer my own question in case anyone stumbles upon it. I found this very nice blogpost where performance has been tested for this approach: Clustered indexing and query performance

    The conclusions of the experiments carried out in the blogpost are:

    PROS:

    1. Using the compound primary key, in my case that would be PRIMARY KEY (date, id), speeds up date range queries about nine times.

    2. The performance on date range queries will remain close to constant as the table grows, whereas it will gradually degrade if the primary key is only on 'id'.

    CONS:

    1. Simple look-ups on the 'id' column (e.g. joins on the table) will be almost three times slower with the compound index.

    2. The larger clustered index results in a noticeably larger disk utilization. This is because all the other indexes point to the clustered index, which means the large clustered index increases the size of all the other indexes. This probably would lead to less cache hits for queries as less of the table could fit into memory.

    So if the major part of the queries against the table are business critical date range queries and your system can "afford" the penalty on look-ups, this approach might help you.

    My own conclusion is that I do not dare to do this in production. Instead I will do table partitioning on YEAR of the date column to prevent full table scans on date range queries.

    Hope this will help someone facing similar challenges.