Search code examples
phpxmlparsingxml-parsingdita

Parsing DITA / XML files with text encapsulated nodes in php


I'm trying to parse a .dita file, but there is a node inside another node, and while that isn't weird, there is actually text surrounding the inner node, it looks a bit like this:

<node>
    Hello this is a <xlink src="example.com">LINK</xlink> that you may click
</node>

I can get the text from node and i can get all instances of xlink, yet the text from the node will look like this:

Hello this is a  that you may click

As you can see, the word LINK is missing, and even though i can call the xlink node and get an array containing the word LINK, it hasn't thus far been possible to place the words back, as their position is unknown.

I'll have to add that checking for 2 spaces wouldn't work, as there can also be 2 spaces in the original text, and thus the position of the words won't be correct.


Solution

  • The DOMElement::$textContent contains the text content of all descendant nodes.

    If you fetch values via Xpath expression you can use the string() function to cast the first node into a string - returning its text content.

    $xml = <<<'XML'
    <node>
        Hello this is a <xlink src="example.com">LINK</xlink> that you may click
    </node>
    XML;
    
    $document = new DOMDocument();
    $document->loadXml($xml);
    $xpath = new DOMXpath($document);
    
    // access the text conent of the node element
    var_dump($document->documentElement->textContent);
    
    // use Xpath string() function
    var_dump($xpath->evaluate('string(self::node)', $document->documentElement));
    

    Output:

    string(45) "
        Hello this is a LINK that you may click
    "
    string(45) "
        Hello this is a LINK that you may click
    "