Search code examples
phplibxml2

libxml parse all content if there are multiple root nodes


I know. It's not valid XML.

Unfortunately it's part of a work around for a bug in the php source code which utilizes libxml.

Php loadHTML function overwrites the no warning and no error flags accidently, so if you pass those options, they never make it to libxml.

Php's loadXML does not make the same mistake. All flags work as expected. So I'm looking into using loadXML as a substitute for now. Unfortunately loadXML is not good for loading, say, template snippets or widgets because it will stop parsing after a single root node. So something like....

 <!--My title snippet -->
 <h1>${{ title }}</h1>
 <h4>${{ subtitle }}</h4>

will only be partially loaded with loadXML. Is there any option flag to force libxml's parser to keep going? Or am I going to have to require all snippets be wrapped in a root node?

Note

I have explored other ways of getting around the bug. For example by using LIBXML_USE_INTERNAL_ERRORS(true) or by catching and clearing warnings with output buffer. Both work, but neither is satisfactory since they write warnings and errors into memory that I don't want.


Solution

  • The corresponding libxml2 function is xmlParseBalancedChunkMemory. The only place I could find where this function is exposed indirectly by the PHP API is DOMDocumentFragment::appendXML.

    $doc = new DOMDocument();
    $fragment = $doc->createDocumentFragment();
    $fragment->appendXML('<h1>H1</h1><h4>H4</h4>');
    print $doc->saveXML($fragment);
    

    But if you're trying to parse HTML, you'll likely run into trouble.