Search code examples
phpxmlsimplexml

SimpleXML get Element Content between Child Elements


I am parsing XML in PHP with SimpleXML and have an XML like this:

<xml>
    <element>
        textpart1
            <subelement>subcontent1</subelement>
        textpart2
            <subelement>subcontent2</subelement>
        textpart3
    </element>
</xml>

When I do $xml->element it naturally gives me the whole element, as in all three textparts.

So if I parse this into an array (with a foreach for the children) I get:

0 => textpart1textpart2textpart3, 1 => subcontent1, 2 => subcontent2

I need a way to parse the <element> node so that each textpart that stops at, or begins after a subelement is treated as its own element.

As a result I am looking for an ordered list that could be express in an array like this:

0 => textpart1, 1 => subcontent1, 2 => textpart2, 3 => subcontent2, 4 => textpart3

Is that possible without altering the XML file? Thanks in advance for any hints!


Solution

  • As others have said, SimpleXML doesn't have any support for accessing individual text nodes as separate entities, so you will need to supplement it with some DOM methods. Thankfully, you can switch between the two at will using dom_import_simplexml and simplexml_import_dom.

    The key pieces of DOM functionality you need are:

    • the DOMElement->childNodes member variable for accessing all nodes directly under a particular element as an iterable list
    • the DOMNode->nodeType variable for determining if a particular child is a text node or an element
    • the DOMNode->nodeValue variable to get the actual text

    Given those, you can write a function which returns an array with a mixture of SimpleXML objects for child elements, and strings for child text nodes, something like this:

    function get_child_elements_and_text_nodes($sx_element)
    {
        $return = array();
    
        $dom_element = dom_import_simplexml($sx_element);
        foreach ( $dom_element->childNodes as $dom_child )
        {
            switch ( $dom_child->nodeType )
            {
                case XML_TEXT_NODE:
                    $return[] = $dom_child->nodeValue;
                break;
                case XML_ELEMENT_NODE:
                    $return[] = simplexml_import_dom($dom_child);
                break;
            }
        }
    
        return $return;
    }
    

    In your case, you need to recurse down the tree, which makes it a little confusing if you mix DOM and SimpleXML as you go, so you could instead write the recursion entirely in DOM and convert the SimpleXML object before running it:

    function recursively_find_text_nodes($dom_element)
    {
        $return = array();
    
        foreach ( $dom_element->childNodes as $dom_child )
        {
            switch ( $dom_child->nodeType )
            {
                case XML_TEXT_NODE:
                    $return[] = $dom_child->nodeValue;
                break;
                case XML_ELEMENT_NODE:
                    $return = array_merge($return, recursively_find_text_nodes($dom_child));
                break;
            }
        }
    
        return $return;
    }
    
    $text_nodes = recursively_find_text_nodes(dom_import_simplexml($simplexml->element));
    

    Here's a live demo of that last function.