mysql performance innodb load-data-infile

How can I improve LOAD DATA performance on large InnoDB tables?

I have this table with more than 7 million rows and I am LOAD DATA LOCAL INFILE'ing more data in the order of 0.5 million rows at a time into it. The first few times were fast, but this addition is taking increasingly long, probably due to indexing overhead:

CREATE TABLE `orthograph_ests` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `digest` char(32) NOT NULL,
  `taxid` int(10) unsigned NOT NULL,
  `date` int(10) unsigned DEFAULT NULL,
  `header` varchar(255) NOT NULL,
  `sequence` mediumblob,
  PRIMARY KEY (`id`),
  UNIQUE KEY `digest` (`digest`),
  KEY `taxid` (`taxid`),
  KEY `header` (`header`)
) ENGINE=InnoDB AUTO_INCREMENT=12134266 DEFAULT CHARSET=latin1

I am developing an application that will run on pre-existing databases. I most likely have no control over server variables unless I make changes to them mandatory (which I would prefer not to), so I'm afraid suggestions like these are of limited use.

I have read that minimizing keys on this table will help. However, I need those keys for later queries. I'm guessing that if I drop and re-create them would take very long as well, but I have not tested this. I have also read that especially the UNIQUE constraint makes the insertion slow. The digest column will take SHA256 digests that must be unique, and I can't make sure there is no collision (very unlikely, I know, but possible).

Would partitioning help, as suggested here? Could I improve the indexing, e.g., by limiting the key length on the digest column? Should I change to MyISAM, which supports DISABLE KEYS during transcactions? What else could I do to improve LOAD DATA performance?

Edit:

After the large insertion, this table is used for SELECTs only, no more writes. This large loading is mostly a once-and-done operation, however about 1,000 datasets (of each 0.5M rows) need to be uploaded before this is finished.

I will be using the digest to look up rows, which is why I indexed that column. If there should be a collision, that individual row should not be uploaded.

Putting the sequence blob in an external file system is probably not a viable option since I cannot easily impose file system changes on the users.

Solution

This is indeed a large amount of data you are loading, and you should expect it to take many dozens of hours, especially on general purpose shared server hardware. There's very little magic (unless you work at Google or something) that will make this job anything but a big pain in the neck. So have courage.

It's a reference table. That means you should immediately switch to MyISAM and stay there for this table. You don't need InnoDB's transactional integrity features, but you do need MyISAM to disable indexing during loading and re-enable it afterward. Re-enabling indexing will take a long time, so be prepared for that.

You should consider using a shorter hash than SHA-256. SHA-1 (160 bits) is good. Believe it or not, MD-5 (128 bits) may also serve. MD-5 has been cracked so it's not suitable for secure content authentication. But it's still a useful hash. A shorter hash is a better hash from your perspective.

If you can disable indexing MyISAM style, it probably doesn't matter much whether your digest key is unique. But you might consider allowing it to be non-unique to save time.

It's hard to make a suggestion about partitioning without knowing more about your data and your server hardware. But considering this is a reference database, it seems like it might be wise just to bite the bullet for a couple of weeks and get it loaded.

If you have plenty of server disk space, you might consider loading each half-megarow chunk into its own table, then inserting it into the big table. That might prove to be a good way to deal with the possibility that you might have to reload the whole thing some day.

On shared server hardware, it might make sense to use smaller chunks than half a megarow.

You might consider making a separate id / digest table. Then you could load your data without the digests and get it done quickly. Then you could write yourself a stored procedure or client that would create the digests in batches of a few thousand rows each until they were done. This only works if the stuff being digested is in your dataset.