Search code examples
mysqldatabasedatabase-migrationpartitioningdowntime

Best way to optimize MySQL operations after a table grows


Background info: we run a an e-commerce website and are trying to figure out what is the best way to deal with "historical" data for tables that are used very frequently and are bound too contain a lot of records (ie. orders, customers, etc).

I am particularly looking at 2 specific scenarios:

  • DB migrations
  • SELECTs

DB migrations

In the case of DB migrations, we are starting to see that we sometimes need to run some ALTER TABLE which are locking the whole table and, if the table has so many records, this can take a while. Of course, all operations on the table are on hold until the migration is done, which means that our checkout might be down just because we are changing a VARCHAR(15) to VARCHAR(256).

From MySQL 5.6, a lot of operations are done "INPLACE" which means (from what I understood) that they wont be creating a full-table lock: that's ok-ish but still not perfect -- what if we need to change the type of a column (cannot be executed INPLACE) and we really do not want to be in maintenance mode for minutes?

My super-ghetto idea was to simply replicate the table (copy it over), then execute the migration on the copied table, stop writing to the original table (ie. lock it), copy the data that wasnt synced to the copied one and swap them. I think the percona tool for zero-downtime migrations does something similar so maybe that's the "best" approach?

Opinions?

SELECTs

For SELECTs, since most of the old data is accessed very rarely, I thought of range-partitioning it by date (say pre-2015 / post-2015 for example) and then change most of our queries to fetch stuff WHERE YEAR(created_at) >= 2015.

If the user wants his full historical data then we'd dynamically remove that condition. This somehow ensures that data is well-partitioned.

Any other idea? Do you think partitioning might be worth it?


Solution

    • Not until 5.7.1 can you quickly do this:

    VARCHAR size may be increased using an in-place ALTER TABLE, as in this example:

    ALTER TABLE t1 ALGORITHM=INPLACE, CHANGE COLUMN c1 c1 VARCHAR(255);

    • See pt-online-schema-change.

    • If you already have Replication set up, then you can play the game of ALTERing the Slave, then failing over. (Yes the Percona tools are handy in this area.)

    • Don't 'hide' columns inside functions; the optimizer can't see them:

      WHERE YEAR(created_at) >= 2015. --> WHERE created_at >= '2015-01-01'

    • Partitioning into only 2 partitions is unlikely to provide any performance benefit.

    • It is reasonable (and commonly done) to PARTITION BY RANGE of some date (eg, TO_DAYS()) for the purpose of eventually purging (via DROP PARTITION) old data. The DROP is so much faster and less invasive than a big DELETE that that feature, alone, justifies partitioning. The pruning that you mentioned rarely speeds up queries (except where indexes are poor). More discussion of sliding time series