What is the fastest way to load data from flatfiles into a MySQL database, and then create the relations between the tables via foreign keys?
For example... I have a flat file in the format:
[INDIVIDUAL] [POP] [MARKER] [GENOTYPE]
"INDIVIDUAL1", "CEU", "rs55555","AA"
"INDIVIDUAL1", "CEU", "rs535454","GA"
"INDIVIDUAL1", "CEU", "rs555566","AT"
"INDIVIDUAL1", "CEU", "rs12345","TT"
...
"INDIVIDUAL2", "JPT", "rs55555","AT"
Which I need to load into four tables:
IND (id,fk_pop,name)
POP (id,population)
MARKER (id,rsid)
GENOTYPE (id,fk_ind,fk_rsid,call)
Specifically, how does one populate the foreign keys in a way that scales? The figures are in the range of 1000+ individuals, each with 1 million+ genotypes.
There is a simpler way.
First, make sure you have a UNIQUE constraint on those columns that should have one (name, population, rsid).
Then use something like the following:
LOAD DATA INFILE 'data.txt' IGNORE INTO TABLE POP FIELDS TERMINATED BY ','
ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 LINES
(@name, population, @rsid, @call);
LOAD DATA INFILE 'data.txt' IGNORE INTO TABLE MARKER FIELDS TERMINATED BY ','
ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 LINES
(@name, @population, rsid, @call);
LOAD DATA INFILE 'data.txt' IGNORE INTO TABLE IND FIELDS TERMINATED BY ','
ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 LINES
(name, @population, @rsid, @call)
SET fk_pop = (SELECT id FROM POP WHERE population = @population);
LOAD DATA INFILE 'data.txt' IGNORE INTO TABLE GENOTYPE FIELDS TERMINATED BY ','
ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 LINES
(@name, @population, @rsid, call)
SET fk_ind = (SELECT id FROM IND where name = @name),
fk_rsid = (SELECT id FROM MARKER where rsid = @rsid);
Note where the @ is used to indicate variables, rather than column names. In the first 2 LOAD DATAs, these are just used to ignore data. In the second 2, they are used to look up the foreign keys.
Might not be very fast, mind :).