Search code examples
phpxmlvalidationxpathxhtml

illegal self closing node notation for empty nodes - outputting XHTML with PHP DOMDocument


I am processing an XML compliant input of XHTML using XPATH in PHP like this:

$xml=new DOMDocument();
$xml->loadXML(utf8_encode($temp));
[...]
$temp=utf8_decode($xml->saveXML());

The problem that arises is that nodes that may not be self closing according to the HTML5 specs, e.g.

<textarea id="something"></textarea>

or a div to leverage by JS

<div id="someDiv" class="whaever"></div>

come back out as

<textarea id="something" />

and

<div id="someDiv" class="whaever" />

I currently address this by using str_replace, but that's nonsese as I need to match individual cases. How can I solve this?

At the same time XPATH insists on putting out

xmlns:default="http://www.w3.org/1999/xhtml

and on individual nodes freshly created, it puts stuff like <default:p>. How do I stop that without resorting to stupid search and replace like this:

$temp=str_replace(' xmlns:default="http://www.w3.org/1999/xhtml" '," ",$temp);
$temp=str_replace(' xmlns:default="http://www.w3.org/1999/xhtml"'," ",$temp);
$temp=str_replace('<default:',"<",$temp);
$temp=str_replace('</default:',"</",$temp);

?

EDIT: I'm really getting trouble with the stupid search and replace and I do not intend to attack the output XHTML with RegExp. Consider this example:

<div id="videoPlayer0" class="videoPlayerPlacement" data-xml="video/cp_IV_a_1.xml"/>

Obviously self-closing divs are illegal (at least in one context where I cannot output as mime application/xhtml+xml but am forced to use mime text/html) and in all other cases they sure don't validate.


Solution

  • It is possible to normalize "non void" tags using a trick. It is not an official solution, but it works.

    function export_html(DOMDocument $dom)
    {
        $voids = [
            'area',
            'base',
            'br',
            'col',
            'colgroup',
            'command',
            'embed',
            'hr',
            'img',
            'input',
            'keygen',
            'link',
            'meta',
            'param',
            'source',
            'track',
            'wbr',
        ];
    
        // Every empty node; 
        // there is no reason to match nodes with content inside.
        $query = '//*[not(node())]';
        $nodes = (new DOMXPath($dom))->query($query);
    
        foreach ($nodes as $node) {
            if (in_array($node->nodeName, $voids)) {
                // A void tag.
                continue;
            }
            // Not a void tag. We inject a placeholder content.
            $node->appendChild(new DOMComment('NOT_VOID'));
        }
        
        // We remove the placeholders.
        return str_replace('<!--NOT_VOID-->', '', $dom->saveXML());
    }
    

    In your example

    $dom = new DOMDocument();
    $dom->loadXML(<<<XML
    <html>
        <textarea id="something"></textarea>
        <div id="someDiv" class="whaever"></div>
    </html>
    XML
    );
    

    echo export_html($dom); will produce

    <?xml version="1.0"?>
    <html>
        <textarea id="something"></textarea>
        <div id="someDiv" class="whaever"></div>
    </html>