Search code examples
phpxmlparsingxmlreader

PHP XMLReader stumbles upon invalid character and stops


As the title says.

I'm processing large downloaded XML files on the fly. Some of those files contain invalid characters such as "US" or "VB" (vertical tab). No clue why those characters are there to begin with. There's nothing I can really do about them.

$z = new XMLReader;
$z->open('compress.zlib://'.$file, "UTF-8");
while ($z->read() && $z->name !== 'p');
while ($z->name === 'p'){

try
{
    $node = new SimpleXMLElement($z->readOuterXML());
}catch(Exception $e)
{
    echo $e->getMessage();
}
// And so on
}

I get an error saying "String could not be parsed as XML".

What can I do here?


Solution

  • Ended up finding a solution after all.

    I decided to use fopen to construct & process on the fly. Here's what I ended up with:

    $handle = fopen('compress.zlib://'.$file, 'r');
    $xml_source = '';
    $record = false;
    if($handle){
        while(($buffer = fgets($handle, 4096)) !== false){
            if(strpos($buffer, '<open_tag>') > -1){
                $xml_source = '<?xml version="1.0" encoding="UTF-8"?>';
                $record = true;
            }
            if(strpos($buffer, '</close_tag') > -1){
                $xml_source .= $buffer;
                $record = false;
                $xml = simplexml_load_string(stripInvalidXml($xml_source));
    
                // ... do stuff here with the xml element
    
            }
            if($record){
                $xml_source .= $buffer;
            }
    
        }
    }
    

    The function simplexml_load_string() is the one quickshiftin provided. Works like a charm.