Search code examples
mysqlcsvload-data-infile

Import huge data set csv in MYSQL


I'm trying to import a huge data set from a csv file ~400MB with 900000 rows. This file has the information of two relational tables. For example:

["primary_key","name","lastname","phone,"work_id","work_name"]

Every row i have to check if the primary key exists for insert or updated if needed, also i need to verify work, because new works can appear in this dataset.

My person table has more colummns that the csv file has, so i can't replace the line with mysqlimport.

Any ideas on how to work with this?


Solution

  • Please read the documentation for LOAD DATA INFILE; it is a good choice for loading data, even very big files. Quoting from Reference manual: Speed of insert statements:

    When loading a table from a text file, use LOAD DATA INFILE. This is usually 20 times faster than using INSERT statements

    Assuming that your table as more columns than the .csv file, then you'd have to write something like this:

    load data local infile 'path/to/your/file.csv'
    into table yourTable
    fields terminated by ',' optionally enclosed by '"' lines terminated by '\n'
    ignore 1 lines -- if it has column headers
    (col1, col2, col3, ...) -- The matching column list goes here
    

    See my own question on the subject and its answer.

    Also, if you need faster inserts, you can:

    • Ignore foreign key constraints, with SET foreign_key_checks = 0; before executing load data, and/or
    • Disable the indexes of the table with alter table yourTable disable keys; before executing load data, and rebuilding them afterwards with alter table yourTable enable keys;

    Untested: If your .csv file has more columns than your table, I think that you can assign the "exceeding" columns in the file to temp variables:

    load data local infile 'path/to/your/file.csv'
    into table yourTable
    fields terminated by ',' optionally enclosed by '"' lines terminated by '\n'
    ignore 1 lines -- if it has column headers
    (col1, col2, col3, @dummyVar1, @dummyVar2, col4) -- The '@dummyVarX` variables
                                                     -- are simply place-holders for
                                                     -- columns in the .csv file that
                                                     -- don't match the columns in 
                                                     -- your table