Search code examples
mysqloptimizationmyisamload-data-infile

MySQL Optimization for LOAD DATA INFILE


I see everywhere programmers discuting optimisation for fastest LOAD DATA INFILE inserts. But they never explain much their values choices etc, and optimisation depends on environment and also on the actual real needs.

So, would like some explainations on what would be the best values in my mysql config file for reaching the fastest insert possible, please.

My config, an intel dual-core @ 3.30 GHz, 4Gb DDR4 RAM (windows7 says "2.16Gb available" tho because of reserved memory).

My backup.csv file is plaintext as about 5 billions entries, so its a huge 500Gb file size like this schem (but hexadecimal string 64 length):

 "sdlfkjdlfkjslfjsdlfkjslrtrtykdjf";"dlksfjdrtyrylkfjlskjfssdlkfjslsdkjf"

Only two fields in my table and the first one is Unique index. ROW-FORMAT is set on FIXED for space saving questions. And for same reason, fields type is set as BINARY(32).

Im using MyISAM engine. (because innoDB requires much more space!) (MySQL version 5.1.41)

here is the code i planned to use for now :

 ALTER TABLE verification DISABLE KEYS;
 LOCK TABLES verification WRITE;
 LOAD DATA INFILE 'G:\\backup.csv'
      IGNORE INTO TABLE verification
      FIELDS TERMINATED BY ';' ENCLOSED BY '"' LINES TERMINATED BY '\r\n'
      (@myhash, @myverif) SET hash = UNHEX(@myhash), verif = UNHEX(@myverif);
 UNLOCK TABLES;
 ALTER TABLE verification ENABLE KEYS;

As you can see, the command use LOAD DATA INFILE takes the plain text values, turn them into HEX (both are hexadecimal hashes finaly so...)

I heard about the buffer sizes etc, and all those values from the MySQL config file. What should i change, and what would be the best values please? As you can see, i lock the table and also disable keys for speeding it already.

I also read on documentation :

 myisamchk --keys-used=0 -rq /var/lib/mysql/dbName/tblName

Doing that before the insert would speed it up also. But what is really tblName ? (because i have a .frm file, a .MYD and a .MYI, so which one am i supposed to point?)

Here are the lasts short hints i did read about optimisation

EDIT : Forgot to tell, everything is localhost.


Solution

  • So, i finaly managed to Insert my 500GB database of more than 3 billions entries, in something like 5 hours.

    i have tried many ways, and while rebuilding the Primary Index i was stuck with this error ERROR 1034 (HY000): Duplicate key 1 for record at 2229897540 against new record at 533925080.

    I will explain now how i achieved to complete my insert:

    • i sorted my .csv file with GNU CoreUtils : sort.exe (im on windows) keep in mind doing that, you need 1.5x your csv file as free space, for temporary files. (so counting the .csv file, its 2.5x finaly)
    • You create the table, with indexes and all.
    • Execute mysqladmin flush-tables -u a_db_user -p
    • Execute myisamchk --keys-used=0 -rq /var/lib/mysql/dbName/tblName
    • Insert the data : (DO NOT USE ALTER TABLE tblname DISABLE KEYS; !!!)

      LOCK TABLES verification WRITE;
      LOAD DATA INFILE 'G:\\backup.csv'
          IGNORE INTO TABLE verification
          FIELDS TERMINATED BY ';'
          ENCLOSED BY '"'
          LINES TERMINATED BY '\r\n'
          (@myhash, @myverif) SET hash = UNHEX(@myhash), verif = UNHEX(@myverif);
      UNLOCK TABLES;
    • when data is inserted, you rebuild the indexes by Executing myisamchk --key_buffer_size=1024M --sort_buffer_size=1024M -rqq /var/lib/mysql/dbName/tblName (note the -rqq, doubling the q will ignore the possible duplicate error by trying to repair them (Instead of just stopping the inserts after many hours of waiting!)

    • Execute mysqladmin flush-tables -u a_db_user -p

    And i was done!

    • I noticed a huge boost in speed if the .csv file is on another drive than the database, and same for the sort operation, put temp file in another drive. (Read/Write speed as not both datas in the same place)

    source of this again was here : Credits here to this solution