Search code examples
phpjavascriptcsvutf-8ansi

Identify all non-standard special characters in an ANSI-encoded CSV


I have an ANSI-encoded CSV file that contains a number of 'problem' special characters. I'm looking for a script (preferably php or javascript) that I can use to check each record in the CSV and identify those that have problem characters.

I have no trouble looping through the CSV records, so I'm just looking for a good way to determine whether a a single string contains any characters that would cause problems if the string was inserted directly into a UTF-8 encoded file.

Background: I used a script to convert an ANSI CSV directly to UTF-8 XML without taking care to convert the CSV to UTF-8, first. Boneheaded move on my part. The script created XML entities for records with problem characters, but all textNodes into which the script tried to insert text with problem characters ended up empty. What I'm looking for, now, is a way to parse the original CSV file and identify all records containing problem characters. With ~18,000 records, it's not a job that I'd like to do manually :-)

Clarification I should have first converted the ANSI CSV to UTF-8, then run my 'convert to XML' script on the UTF-8 encoded CSV file. Instead, I skipped the first step and ran my 'convert to XML' script on the ANSI encoded CSV file. XML entities were created for all cells, but the XML entities for cells with characters such as — (em dash) and ½ (one half) were all empty. The 'convert to XML' script silently failed to insert these strings into the UTF-8 encoded XML document (using DOMDocument in PHP).


Solution

  • Folks, this is quick and dirty, but that's the kind of solution I needed in this situation. I used the following code to scan through the original CSV, looking at each character in each row. Any row with a characater with ord() > 127, I inserted into a second CSV. This new CSV file contained only the rows that had 'special' characters.

    In this particular case, my original CSV was larger than 5MB, and the new CSV containing only rows with special characters was much smaller, on the order of a couple hundred KB, which made it much easier to work with.

    $input_file  = fopen($input_filePath, 'rt');
    $output_file = fopen($output_filePath, 'w');
    
    // Get the column headers of the file
    $headers = fgetcsv($input_file);
    
    // Loop through each row
    while (($row = fgetcsv($input_file)) !== FALSE)
    {
      // Loop through each cell
      foreach ($headers as $i => $header)
      {
        $cell = $row[$i];
        // Loop through each char until we find a 'special' char
        // or reach the end of the cell, whichever comes first
        for ($j = 0; $j < strlen($cell); $j++)  {   
          if (ord(substr($cell, $j, 1)) > 127) {
            // If we find a special char, add this row to the new CSV file
            fputcsv($output_file, $row);
            break;
          }
        }
      }
    }