Search code examples
mysqlload-data-infile

MySQL Load Data InFile; skip rows IF


I've spent a fair amount of time googling this but I can't seem to point myself in the right direction of exactly what I'm looking for. The problem with my .csv file is that, while the line terminator is ',,,,', some lines do not include this, so when I import the file it's fine until it gets to one of these, but then it treats it as one record that's about twice as long as the amount of columns a standard record should have, and then it's thrown off from that point forward. What I need to do is skip the records (data between ',,,,' terminations) that have more than the correct number of columns (15). I realize this will essentially skip 2 records each time this happens, but that's fine for the purpose of what I'm doing with a pretty large dataset.

I've come across the IGNORE keyword, but that doesn't seem to apply. What I'm looking for is something like: for each record during import, skip record if record.columns.count > 15. Here is my import statement, thanks for any help provided.

LOAD DATA LOCAL INFILE "/Users/foo/Desktop/csvData.csv"
INTO TABLE csvData
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
ESCAPED BY '"'
LINES TERMINATED BY ',,,,';

Solution

  • If you just want to skip the malformed records, a simple awk command to filter only the good records is:

    awk -F, '{ if (NF == 15) print; }' csvData.csv > csvData_fixed.csv
    

    Then LOAD DATA from the fixed file.

    If you want to get fancier, you could write a script using awk (or Python or whatever you like) to rewrite the malformed records in the proper format.


    Re your comment: The awk command reads your original file and outputs only each line that has exactly 15 fields, where fields are separated by commas.

    Apparently your input data has no lines that have exactly 15 fields, even though you described it that way.


    Another thought: it's a little bit weird to use the line terminator of ',,,,' in your original LOAD DATA command. Normally the line terminator is '\n' which is a newline character. So when you redefine the line terminator as ',,,,' it means MySQL will keep reading text until it finds ',,,,' even if that ends up reading dozens of fields over multiple lines of text. Perhaps you could set your line terminator to ',,,,\n'.