Search code examples
phpdomdocumentlibxml2

Why does PHP's DOMDocument remove leading whitespace from Processing Instruction nodes? (<?php ?>)


I am loading a XML-compliant PHP file into DOMDocument.

    $domDoc = new DOMDocument();
    $domDoc->recover            = TRUE;
    $domDoc->preserveWhiteSpace = TRUE;
    $domDoc->formatOutput       = FALSE;
    $domDoc->substituteEntities = FALSE;
    $domDoc->resolveExternals   = FALSE;

Despite preserving whitespace and instructing it to not format the output, I am still finding the leading whitespace in <?php ?> blocks removed when I save the XML with $domDoc->saveXML().

Input:

<?xml version="1.0" encoding="UTF-8"?>
<html>
<?php

// This is code.

// Something else.
    echo 'test';

?>
</html>

Output:

<?xml version="1.0" encoding="UTF-8"?>
<html>
<?php // This is code.

// Something else.
    echo 'test';

?>
</html>

I want the output to be as identical to the input as possible. Collapsing whitespace between attributes is acceptable, but collapsing whitespace between nodes or within a Processing Instruction is not okay. Why is PHP::DOMDocument() / libxml2 changing the contents of the PI? Will I need to resort to manual DOM echoing to keep the whitespace completely preserved?


Solution

  • Leading white space in a PI node is actually okay to collapse, as the DOM considers the data portion of a processing instruction to be:

    The content of this processing instruction. This is from the first non white space character after the target to the character immediately preceding the ?>.

    (Emphasis mine.)

    The preserveWhiteSpace setting only applies to text nodes, which is why that doesn't help you here.

    In any case I would advise not relying on embedded PHP to be treated as a processing instruction as PHP can contain ?> within it (e.g. as part of a string literal) which would terminate the processing instruction early.