Currently I'm testing out TokuDB and I'm very impressed. At this moment the inserts per second have peaked to just over 50.000 per second with two simultanious jobs running. The average insert rate is between 38.000 and 42.000 inserts per second.
I would like to go even higher, 100.000 inserts per second, as I will need to insert 1.2 billion calculated rows for now and about 6 billion more in the near future. I would like some advise on how to achieve this :-)
My current setup:
My.cnf settings:
# TokuDB #
tokudb_cache_size = 2G
tokudb_commit_sync = 0
tokudb_fsync_log_period = 1000
TokuDB table layout:
CREATE TABLE `t1` (
`id` int(15) NOT NULL AUTO_INCREMENT,
`m_id` int(11) NOT NULL,
`c1` decimal(6,2) DEFAULT NULL,
`c2` decimal(6,2) DEFAULT NULL,
`c3` decimal(6,2) DEFAULT NULL,
`c4` decimal(6,2) DEFAULT NULL,
`c5` decimal(6,2) DEFAULT NULL,
`c6` decimal(6,2) DEFAULT NULL,
`c7` decimal(6,2) DEFAULT NULL,
`factor` decimal(4,2) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=TokuDB DEFAULT CHARSET=latin1
CREATE TABLE `t2` (
`id` int(15) NOT NULL AUTO_INCREMENT,
`v_id` int(15) NOT NULL,
`pid` int(11) DEFAULT NULL,
`amount` decimal(6,2) DEFAULT NULL,
`unit` int(1) DEFAULT '0',
PRIMARY KEY (`id`)
) ENGINE=TokuDB DEFAULT CHARSET=latin1
I'm aware of the fact that I'm not using any indexes other then the primary key index. This is due to the negative time impact the keys will have on inserting. A cluster key for each table will be created at the end of the insert job.
Additional MySQL commandline option:
SET unique_checks=OFF;
Somehow I'm not able to get this in the my.cnf.. If someone would know how then this would be greatly appreciated (currently unique_checks = off will block MySQL from starting due a unkown variable in the my.cnf).
The SQL statements are grouped in batches of 15.000. A PHP script generates the SQL statements and sends the query via mysqli_multiquery to the MySQL server:
<?PHP
foreach (generateCombinations($Arr) as $c) {
$QueryBatch[] = "insert into t1 values (NULL"
. ", " . $record->id
. ", " . rand(1, 35)
. ", " . rand(1, 140)
. ", " . rand(1, 20)
. ", NULL"
. ", " . rand(1, 14)
. ", " . rand(1, 300)
. ", " . rand(1, 4)
. ", NULL );";
$QueryBatch[] = "SET @t1id = LAST_INSERT_ID();";
$cntBatch++;
$pquery = array();
foreach ( $c as $key => $pid){
if ( is_null($pid) )
continue;
$pquery[] = "(NULL, @t1id, " . $pid . ", " . rand(1, 800) . ", 0)";
$cntBatch++;
}
$QueryBatch[] = "insert into t2 values " . implode(',', $pquery) . ";";
if ($cntBatch > 15000) {
$query = implode($QueryBatch);
if ( $mysqli->multi_query($query) ){
while ($mysqli->next_result()) {;}
} else {
printf("Errormessage: %s\n", $mysqli->error);
echo $query . "\n";
}
$cntBatch = 0;
unset($QueryBatch);
}
}
?>
Example of SQL insert statement:
insert into t1 values (NULL, 1 , 30, 100, 15, NULL, 10, 250, 2, NULL );
SET @t1id = LAST_INSERT_ID();
insert into t2 values (NULL, @t1id, 1, 750, 0),(NULL, @t1id, 1, 600, 0),(NULL, @t1id, 1, 500, 0),(NULL, @t1id, 1, 400, 0),(NULL, @t1id, 1, 300, 0),(NULL, @t1id, 1, 200, 0),(NULL, @t1id, 1, 100, 0);
insert into t1 values (NULL, 2 , 25, 95, 14, NULL, 11, 200, 3, NULL );
SET @t1id = LAST_INSERT_ID();
insert into t2 values (NULL, @t1id, 1, 600, 0),(NULL, @t1id, 1, 100, 0),(NULL, @t1id, 1, 300, 0),(NULL, @t1id, 1, 443, 0),(NULL, @t1id, 1, 521, 0),(NULL, @t1id, 1, 213, 0),(NULL, @t1id, 1, 433, 0);
[.. At least 14982 more..]
If it were me, I would cut down the number of statements being executed, and cut down the number of commits. I'm assuming that AUTO_COMMIT is enabled, given that we don't see any BEGIN TRANSACTION
or COMMIT
statements.
That's a whole bloatload of individual INSERT
and SET
statements. At least the inserts to the child table are using the multiple row insert, not separate insert statements for each row.
If I needed this to be fast, I would
id
values for the t1
table, and include those in the INSERT statementLAST_INSERT_ID()
t1
(rather than separate INSERT statement for each row)BEGIN TRANSACTION
and COMMIT
t1
(serialize) to avoid potential contention for locksIf it were for InnoDB, I'd also do SET FOREIGN_KEY_CHECKS=0
.
There's already a boatload of calls to the rand
function in the code; so incrementing an integer id
for t1
isn't going to move the needle. When we start, we'd need a query to get the current AUTO_INCREMENT value, or get the MAX(id), whichever...
Basically, I'd cut down the number of statements being executed, and get more work done by each statement, and more work down before each COMMIT
.
Inserting ten (10) t1
rows per statement would significantly reduce the number of statements that need to be executed.
BEGIN TRANSACTION;
-- insert ten rows into t1
INSERT INTO t1 (id,m_id,c1,c2,c3,c4,c5,c6,c7,factor) VALUES
(444055501, 1 , 30, 100, 15, NULL, 10, 250, 2, NULL )
,(444055502, 2 , 25, 95, 14, NULL, 11, 200, 3, NULL )
, ...
,(444055510, 10 , 7, 45, 12, NULL, 10, 300, 4, NULL )
;
-- batch together the t2 rows associated with the ten t1 rows we just inserted
INSERT INTO t2 VALUES
-- 444055501
(NULL, 444055501, 1, 750, 0)
,(NULL, 444055501, 1, 600, 0)
,(NULL, 444055501, 1, 500, 0)
,(NULL, 444055501, 1, 400, 0)
,(NULL, 444055501, 1, 300, 0)
,(NULL, 444055501, 1, 200, 0)
,(NULL, 444055501, 1, 100, 0)
-- 444055502
,(NULL, 444055502, 1, 600, 0)
,(NULL, 444055502, 1, 100, 0)
,(NULL, 444055502, 1, 300, 0)
,(NULL, 444055502, 1, 443, 0)
,(NULL, 444055502, 1, 521, 0)
,(NULL, 444055502, 1, 213, 0)
,(NULL, 444055502, 1, 433, 0)
-- 444055503
, ...
;
-- another ten rows into t1
INSERT INTO t1 (id,m_id,c1,c2,c3,c4,c5,c6,c7,factor) VALUES
(444055511, 11 , 27, 94, 15, NULL, 10, 250, 11, NULL )
,(444055512, 12 , 24, 93, 14, NULL, 11, 200, 12, NULL )
, ...
,(444055520, 10 , 7, 45, 12, NULL, 10, 300, 4, NULL )
;
INSERT INTO t2 VALUES
(NULL, 444055511, 1, 820, 0)
,(NULL, 444055511, 1, 480, 0)
, ...
;
-- repeat INSERTs into t1 and t2, and after 1000 loops
-- i.e. 10,000 t1 rows, do a commit
COMMIT;
BEGIN TRANSACTION;
INSERT INTO t1 ...
LOAD DATA INFILE
Any discussion of performance of inserts would be incomplete without at least mentioning LOAD DATA INFILE
.
For best performance, that can't be beat. But since we don't have the data in a file, and we don't have key values (needed for the foreign key in t2
, and we've got all the calls to rand to generate the data, LOAD DATA INFILE
doesn't seem to be a good fit.