Search code examples
phpcsvcharacter-encodingbyte-order-markutf-16le

PHP cannot parse CSV correctly (file is in UTF-16LE)


I am trying to parse a CSV file using PHP.
The file uses commas as delimiter and double quotes for fields containing comma(s), as:

foo,"bar, baz",foo2

The issue I am facing is that I get fields containing comma(s) separated. I get:

  • "2
  • rue du ..."

Instead of: 2, rue du ....


Encoding:
The file doesn't seem to be in UTF8. It has weird wharacters at the beginning (apparently not BOM, looks like this when converted from ASCII to UTF8: ÿþ) and doesn't displays accents.

  • My code editor (Atom) tells the encoding is UTF-16 LE
  • using mb_detect_encoding() on the csv lines it returns ASCII

But it fails to convert:

  • mb_convert_encoding() converts from ASCII but returns asian characters from UTF-16LE
  • iconv() returns Notice: iconv(): Wrong charset, conversion from UTF-16LE/ASCII to UTF8 is not allowed.

Parsing:
I tried to parse with this one-liner (see those 2 comments) using str_getcsv():

$csv = array_map('str_getcsv', file($file['tmp_name']));

I then tried with fgetcsv() :

$f = fopen($file['tmp_name'], 'r');
while (($l = fgetcsv($f)) !== false) {
    $arr[] = $l;
}
$f = fclose($f);

In both ways I get my adress field in 2 parts. But when I try this code sample I get correctly parsed fields:

$str = 'foo,"bar, baz",foo2,azerty,"ban, bal",doe';
$data = str_getcsv($str);
echo '<pre>' . print_r($data, true) . '</pre>';

To sum up with questions:

  • What are the characters at the beginning of the file ?
  • How could I be sure about the encoding ? (Atom reads the file with UTF-16 LE and doesn't display weird characters at the beginning)
  • What makes the csv parsing functions fail ?
  • If I should rely on something else to parse the lines of the CSV, what could I use ?

Solution

  • I finally solved it myself:

    I sent the file into online encoding detection websites which returned UTF16LE. After checking about what is UTF16LE it says it has BOM (Byte Order Mark).
    My previous attempts were using file() which returns an array of the lines of a file and with fopen() which returns a resource, but we still parse line by line.

    The working solution came in my mind about converting the whole file (every line at once) instead of converting each line separately. Here is a working solution:

    $f = file_get_contents($file['tmp_name']);          // Get the whole file as string
    $f = mb_convert_encoding($f, 'UTF8', 'UTF-16LE');   // Convert the file to UTF8
    $f = preg_split("/\R/", $f);                        // Split it by line breaks
    $f = array_map('str_getcsv', $f);                   // Parse lines as CSV data
    

    I don't get the adress fields separated at internal commas anymore.