I am importing structured log data into MySQL. I hesitate to use the term “big data”, but the size is non-trivial—about 50 columns in the table with the requirement to continuously import 1,000 records a second. There are also requirements that all of this data gets imported to the same table in the same database. Really, the only option I see for any sense of performance (and sanity) is to partition the table structure. Because it’s log data, there’s a timestamp column that I can safely assume a report will include in its query and is a prime candidate for indexing and for use in the partitions’ definition.
Before Murphy’s Law struck, I was having pretty good luck with something that looks like this:
CREATE TABLE `poorly_designed_log_table` (
...
`timestamp` INTEGER UNSIGNED NOT NULL,
...
INDEX (`timestamp`)
)
ENGINE=InnoDB
PARTITION BY RANGE COLUMNS (`timestamp`) (
PARTITION p0001 VALUES LESS THAN (UNIX_TIMESTAMP('2017-01-01')),
PARTITION p0002 VALUES LESS THAN (UNIX_TIMESTAMP('2017-02-01')),
PARTITION p0003 VALUES LESS THAN (UNIX_TIMESTAMP('2017-03-01')),
...
I receive the incoming data in batches about 100-200 rows at a time which I push to the database using a transaction and multi-row INSERT statement. But, because of how the data is fed to me, there can be duplicate records between batches. Just as an example, given one batch from 10:30:00 to 10:35:59 and the next batch from 10:35:00 to 10:40:59, records between 10:35:00 and 10:35:59 appear in both batches. In other words, the timestamp column is indexible but isn’t a primary key.
Luckily, there’s a primary key in the data I can use. My intent is to perform an INSERT IGNORE and let MySQL reject the duplicates on its own. I know I need to alter my definition so the primary key is enforced, and I also know the key needs to be included in the partition definitions. The primary key is VARCHAR(36) and takes the following form: XXXX-<timestamp>-<sequence>. So for example, the primary keys for three unique records with the timestamp 1499942130 would be XXXX-1499942130-1, XXXX-1499942130-2, and XXXX-1499942130-3.
My question is how do I define partitioning using the timestamp and the primary key so that the data is stored in physical “monthly” tables for fast retrieval?
I got this to work:
CREATE TABLE `poorly_designed_log_table` (
`timestamp` int(10) unsigned NOT NULL,
`pk` varchar(36) NOT NULL,
PRIMARY KEY (`timestamp`,`pk`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
PARTITION BY RANGE COLUMNS(`timestamp`, pk)
(PARTITION p0001 VALUES LESS THAN (1483257600, MAXVALUE),
PARTITION p0002 VALUES LESS THAN (1485936000, MAXVALUE),
PARTITION p0003 VALUES LESS THAN (1488355200, MAXVALUE)
);
I had to add the timestamp to the PRIMARY KEY constraint because otherwise you get this error:
ERROR 1503 (HY000): A PRIMARY KEY must include all columns in the table's partitioning function
The MySQL Manual on Partitioning Keys, Primary Keys, and Unique Keys says:
All columns used in the partitioning expression for a partitioned table must be part of every unique key that the table may have.
So it seems to be redundant to make the PK have both columns, but it's necessary.