Search code examples
mysqlperformancenormalizationload-data-infile

Loading flatfiles into a normalized MySQL database


What is the fastest way to load data from flatfiles into a MySQL database, and then create the relations between the tables via foreign keys?

For example... I have a flat file in the format:

[INDIVIDUAL]   [POP]  [MARKER]  [GENOTYPE]

"INDIVIDUAL1", "CEU", "rs55555","AA"  
"INDIVIDUAL1", "CEU", "rs535454","GA"  
"INDIVIDUAL1", "CEU", "rs555566","AT"  
"INDIVIDUAL1", "CEU", "rs12345","TT"  
...  
"INDIVIDUAL2", "JPT", "rs55555","AT"  

Which I need to load into four tables:

IND (id,fk_pop,name)  
POP (id,population)  
MARKER (id,rsid)  
GENOTYPE (id,fk_ind,fk_rsid,call)  

Specifically, how does one populate the foreign keys in a way that scales? The figures are in the range of 1000+ individuals, each with 1 million+ genotypes.


Solution

  • There is a simpler way.

    First, make sure you have a UNIQUE constraint on those columns that should have one (name, population, rsid).

    Then use something like the following:

     LOAD DATA INFILE 'data.txt' IGNORE INTO TABLE POP FIELDS TERMINATED BY ','
        ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 LINES 
        (@name, population, @rsid, @call);
     LOAD DATA INFILE 'data.txt' IGNORE INTO TABLE MARKER FIELDS TERMINATED BY ',' 
        ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 LINES 
        (@name, @population, rsid, @call);
     LOAD DATA INFILE 'data.txt' IGNORE INTO TABLE IND FIELDS TERMINATED BY ',' 
        ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 LINES 
        (name, @population, @rsid, @call) 
        SET fk_pop = (SELECT id FROM POP WHERE population = @population);
     LOAD DATA INFILE 'data.txt' IGNORE INTO TABLE GENOTYPE FIELDS TERMINATED BY ',' 
        ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 LINES 
        (@name, @population, @rsid, call)
        SET fk_ind = (SELECT id FROM IND where name = @name),
        fk_rsid = (SELECT id FROM MARKER where rsid = @rsid);
    

    Note where the @ is used to indicate variables, rather than column names. In the first 2 LOAD DATAs, these are just used to ignore data. In the second 2, they are used to look up the foreign keys.

    Might not be very fast, mind :).