Search code examples
phpexcelparsingcsvline-breaks

How can you parse excel CSV data that contains linebreaks in the data?


I'm attempting to parse a set of CSV data using PHP, but having a major issue. One of the fields is a long description field, which itself contains linebreaks within the enclosures.

My primary issue is writing a piece of code that can split the data line by line, but also recognize when linebreaks within the data should not be used. The linebreaks within this field are not properly escaped, making them hard to distinguish from legitimate linebreaks.

I've tried to come up with a regular expression that can properly handle it, but had no luck so far. Any ideas?

CSV format:

"####","text data here", "text data \n with linebreaks \n here"\n
"####","more text data", "more data \n with \n linebreaks \n here"\n

Solution

  • According to aleske, a commenter in the documentation for PHP's fgetcsv function:

    The PHP's CSV handling stuff is non-standard and contradicts with RFC4180, thus fgetcsv() cannot properly deal with files [that contain line breaks] ...

    And he offered up the following function to get around this limitation:

    function csvstring_to_array(&$string, $CSV_SEPARATOR = ';', $CSV_ENCLOSURE = '"', $CSV_LINEBREAK = "\n") { 
      $o = array(); 
    
      $cnt = strlen($string); 
      $esc = false; 
      $escesc = false; 
      $num = 0; 
      $i = 0; 
      while ($i < $cnt) { 
    $s = $string[$i]; 
    
    if ($s == $CSV_LINEBREAK) { 
      if ($esc) { 
        $o[$num] .= $s; 
      } else { 
        $i++; 
        break; 
      } 
    } elseif ($s == $CSV_SEPARATOR) { 
      if ($esc) { 
        $o[$num] .= $s; 
      } else { 
        $num++; 
        $esc = false; 
        $escesc = false; 
      } 
    } elseif ($s == $CSV_ENCLOSURE) { 
      if ($escesc) { 
        $o[$num] .= $CSV_ENCLOSURE; 
        $escesc = false; 
      } 
    
      if ($esc) { 
        $esc = false; 
        $escesc = true; 
      } else { 
        $esc = true; 
        $escesc = false; 
      } 
    } else { 
      if ($escesc) { 
        $o[$num] .= $CSV_ENCLOSURE; 
        $escesc = false; 
      } 
    
      $o[$num] .= $s; 
    } 
    
    $i++; 
      } 
    
    //  $string = substr($string, $i); 
    
      return $o; 
    } 
    

    That looks like it will do the trick.