Search code examples
mysqldatabase-partitioning

Table partitioning using a timestamp and primary key


I am importing structured log data into MySQL. I hesitate to use the term “big data”, but the size is non-trivial—about 50 columns in the table with the requirement to continuously import 1,000 records a second. There are also requirements that all of this data gets imported to the same table in the same database. Really, the only option I see for any sense of performance (and sanity) is to partition the table structure. Because it’s log data, there’s a timestamp column that I can safely assume a report will include in its query and is a prime candidate for indexing and for use in the partitions’ definition.

Before Murphy’s Law struck, I was having pretty good luck with something that looks like this:

CREATE TABLE `poorly_designed_log_table` (
  ...
  `timestamp` INTEGER UNSIGNED NOT NULL,
  ...
  INDEX (`timestamp`)
)
ENGINE=InnoDB
PARTITION BY RANGE COLUMNS (`timestamp`) (
  PARTITION p0001 VALUES LESS THAN (UNIX_TIMESTAMP('2017-01-01')),
  PARTITION p0002 VALUES LESS THAN (UNIX_TIMESTAMP('2017-02-01')),
  PARTITION p0003 VALUES LESS THAN (UNIX_TIMESTAMP('2017-03-01')),
  ...

I receive the incoming data in batches about 100-200 rows at a time which I push to the database using a transaction and multi-row INSERT statement. But, because of how the data is fed to me, there can be duplicate records between batches. Just as an example, given one batch from 10:30:00 to 10:35:59 and the next batch from 10:35:00 to 10:40:59, records between 10:35:00 and 10:35:59 appear in both batches. In other words, the timestamp column is indexible but isn’t a primary key.

Luckily, there’s a primary key in the data I can use. My intent is to perform an INSERT IGNORE and let MySQL reject the duplicates on its own. I know I need to alter my definition so the primary key is enforced, and I also know the key needs to be included in the partition definitions. The primary key is VARCHAR(36) and takes the following form: XXXX-<timestamp>-<sequence>. So for example, the primary keys for three unique records with the timestamp 1499942130 would be XXXX-1499942130-1, XXXX-1499942130-2, and XXXX-1499942130-3.

My question is how do I define partitioning using the timestamp and the primary key so that the data is stored in physical “monthly” tables for fast retrieval?


Solution

  • I got this to work:

    CREATE TABLE `poorly_designed_log_table` (
      `timestamp` int(10) unsigned NOT NULL,
      `pk` varchar(36) NOT NULL,
      PRIMARY KEY (`timestamp`,`pk`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
    PARTITION BY RANGE COLUMNS(`timestamp`, pk)
    (PARTITION p0001 VALUES LESS THAN (1483257600, MAXVALUE),
     PARTITION p0002 VALUES LESS THAN (1485936000, MAXVALUE),
     PARTITION p0003 VALUES LESS THAN (1488355200, MAXVALUE)
    );
    

    I had to add the timestamp to the PRIMARY KEY constraint because otherwise you get this error:

    ERROR 1503 (HY000): A PRIMARY KEY must include all columns in the table's partitioning function

    The MySQL Manual on Partitioning Keys, Primary Keys, and Unique Keys says:

    All columns used in the partitioning expression for a partitioned table must be part of every unique key that the table may have.

    So it seems to be redundant to make the PK have both columns, but it's necessary.