Search code examples
xmlxpathdomdocument

How to extract html markup within an XML node with XPath


I'm using DOMDocument and XPath.

Given to following XML

<Description>
    <CompleteText>
        <DetailTxt>
            <Text>
                <span>Here there is some text</span>
                <h2>And maybe a headline</h2>
                <br/>
                <span>Normal position</span>
                <br/>
                <span> </span>
                <br/>
            </Text>
        </DetailTxt>            
    </CompleteText>
</Description>

The node /Description/CompleteText/DetailTxt/Text contains markup, unfortunately unescaped, but I can't change that. Is there any chance I can query that content maintaining the html markup?

What I tried

Obviously, nodeValue but also textContent. Both giving me the content omitting markup.


Solution

  • You can use the saveHTML method of DOMDocument to serialize a node as HTML, in your case you seem to want to call it on each child node of the selected node and concatenate the strings; in the browser DOM APIs that would be called innerHTML so I have written a function of that name doing that and also used the ability to call PHP functions from XPath in the following snippet:

    <?php
    $xml = <<<'EOD'
    <Description>
        <CompleteText>
            <DetailTxt>
                <Text>
                    <span>Here there is some text</span>
                    <h2>And maybe a headline</h2>
                    <br/>
                    <span>Normal position</span>
                    <br/>
                    <span> </span>
                    <br/>
                </Text>
            </DetailTxt>            
        </CompleteText>
    </Description>  
    EOD;
    
    $doc = new DOMDocument();
    
    $doc->loadXML($xml);
    
    $xpath = new DOMXPath($doc);
    
    function innerHTML($nodeList) {
      $node = $nodeList[0];
      $html = '';
      $containingDoc = $node->ownerDocument;
      foreach ($node->childNodes as $child) {
          $html .= $containingDoc->saveHTML($child);
        }
      return $html;
    }
    
    $xpath->registerNamespace("php", "http://php.net/xpath");
    $xpath->registerPHPFunctions("innerHTML");
    
    
    
    $innerHTML = $xpath->evaluate('php:function("innerHTML", /Description/CompleteText/DetailTxt/Text)');
    
    echo $innerHTML;
    

    Output as http://sandbox.onlinephpfunctions.com/code/62a980e2d2a2485c2648e16fc647a6bd6ff5620b is

                <span>Here there is some text</span>
                <h2>And maybe a headline</h2>
                <br>
                <span>Normal position</span>
                <br>
                <span> </span>
                <br>