Search code examples
mysqlcsvescapingload-data-infile

Prevent LOAD DATA INFILE from escaping double double quotes


I have csv data like the following:

"E12 98003";1085894;"HELLA";"8GS007949261";"";1
"5 3/4"";652493;"HELLA";"9HD140976001";"";1

Some fields are included in double quotes. The problem is that as you may see in the second line the data in the first column contains a double quotation mark at the end as part of the data.

I tried something along the lines of:

LOAD DATA INFILE file.csv
INTO TABLE mytable
FIELDS TERMINATED BY ';' ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'

but it will use the quotation mark in the data to escape the field enclosing quotation mark. I also tried ESCAPED BY '' and ESCAPED BY '\\' with no success. Is there a way to stop the LOAD DATA INFILE command from escaping the double double quotation marks? Or should I parse the csv and put double quotation marks when there is only one? I am parsing the files anyway using powershell to change the encoding to utf8. Is there some way to fix this quickly there? My powershell code:

function Convert-FileToUTF8 {

    param([string]$infile,
          [string]$outfile,
          [System.Int32]$encodingCode)

    $encoding = [System.Text.Encoding]::GetEncoding($encodingCode)
    $text = [System.IO.File]::ReadAllText($infile, $encoding)
    [System.IO.File]::WriteAllText($outfile, $text)

}

Ok, I did it using a .NET regular expression to fix the csv. It is costly, but not too much. I wrote

$text = [regex]::Replace($text, "(?m)(?<!^)(?<!\;)""(?!\;)(?!\r?$)", '""');

just before the last line in the function and it seems to work ok. Since I am a novice in regular expressions this could probably be improved.


Solution

  • The main problem is that the input data constitutes invalid CSV syntax, as stated in RFC-4180, paragraph 7:

    If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote.

    But in your PowerShell script you could try to fix this issue with an extra line, using the replace method on $text, once you got it's value:

    $text = $text.Replace('"";', '""";')
    

    This should be enough, as the loader will deal well with unescaped double quotes if they appear elsewhere in the data, as stated on mysql.com (my highlight):

    If the field begins with the ENCLOSED BY character, instances of that character are recognized as terminating a field value only if followed by the field or line TERMINATED BY sequence.

    Of course, if the badly formatted CSV has data that contains ";, then you still have a problem. But it is very hard to determine whether such an occurrence terminates the data or should be seen as part of the data, even for humans :-)

    Another thing to pay attention to as found on mysql.com:

    If the input values are not necessarily enclosed within quotation marks, use OPTIONALLY before the ENCLOSED BY keywords.