Search code examples
phpxmlsimplexml

PHP - Processing Invalid XML


I'm using SimpleXML to load in some xml files (which I didn't write/provide and can't really change the format of).

Occasionally (eg one or two files out of every 50 or so) they don't escape any special characters (mostly &, but sometimes other random invalid things too). This creates and issue because SimpleXML with php just fails, and I don't really know of any good way to handle parsing invalid XML.

My first idea was to preprocess the XML as a string and put ALL fields in as CDATA so it would work, but for some ungodly reason the XML I need to process puts all of its data in the attribute fields. Thus I can't use the CDATA idea. An example of the XML being:

 <Author v="By Someone & Someone" />

Whats the best way to process this to replace all the invalid characters from the XML before I load it in with SimpleXML?


Solution

  • What you need is something that will use libxml's internal errors to locate invalid characters and escape them accordingly. Here's a mockup of how I'd write it. Take a look at the result of libxml_get_errors() for error info.

    function load_invalid_xml($xml)
    {
        $use_internal_errors = libxml_use_internal_errors(true);
        libxml_clear_errors(true);
    
        $sxe = simplexml_load_string($xml);
    
        if ($sxe)
        {
            return $sxe;
        }
    
        $fixed_xml = '';
        $last_pos  = 0;
    
        foreach (libxml_get_errors() as $error)
        {
            // $pos is the position of the faulty character,
            // you have to compute it yourself
            $pos = compute_position($error->line, $error->column);
            $fixed_xml .= substr($xml, $last_pos, $pos - $last_pos) . htmlspecialchars($xml[$pos]);
            $last_pos = $pos + 1;
        }
        $fixed_xml .= substr($xml, $last_pos);
    
        libxml_use_internal_errors($use_internal_errors);
    
        return simplexml_load_string($fixed_xml);
    }