Search code examples
phpxmlparsinghtml-parsingsimplexml

Simplexml: parsing HTML leaves out nested elements inside an element with a text node


I'm trying to parse a specific html document, some sort of a dictionary, with about 10000 words and description. It went well until I've noticed that entries in specific format doesn't get parsed well.

Here is an example:

    <?php
    $html = '
        <p>
            <b>
                <span>zot; zotz </span>
            </b>
            <span>Nista; nula. Isto
                <b>zilch; zip.</b>
            </span>
        </p>
        ';

    $xml = simplexml_load_string($html);

    var_dump($xml);
    ?>

Result of var_dump() is:

    object(SimpleXMLElement)#1 (2) {
      ["b"]=>
      object(SimpleXMLElement)#2 (1) {
        ["span"]=>
        string(10) "zot; zotz "
      }
      ["span"]=>
      string(39) "Nista; nula. Isto

            "
    }

As you can see - Simplexml kept text node inside tag but left out a child node and text inside.

I've also tried:

    $doc = new DOMDocument();
    $doc->loadHTML($html);
    $xml = simplexml_import_dom($doc);

with the same result.

As it looked to me that this is a common problem in parsing html I tried googling it out but only place that acknowledges this problem is this blog: https://hakre.wordpress.com/2013/07/09/simplexml-and-json-encode-in-php-part-i/ but does not offer any solution.

There is just too generalized posts and answers about parsing HTML in SO.

Is there a simple way of dealing with this? Or, should I change my strategy?


Solution

  • Your observation is correct: SimpleXML does only offer the child element-node here, not the child text-nodes. The solution is to switch to DOMDocument as it can access all nodes there, text and element children.

    // first span element
    $span = dom_import_simplexml($xml->span);
    
    foreach ($span->childNodes as $child) {
        printf(" - %s : %s\n", get_class($child), $child->nodeValue );
    }
    

    This example shows that dom_import_simplexml is used on the more specific <span> element-node and the traversal is the done over the children of the according DOMElement object.

    The output:

     - DOMText : Nista; nula. Isto
    
     - DOMElement : zilch; zip.
     - DOMText : 
    

    The first entry is the first text-node within the <span> element. It is followed by the <b> element (which again contains some text) and then from another text-node that consists of whitespace only.

    The dom_import_simplexml function is especially useful when SimpleXMLElement is too simple for more differentiated data access within the XML document. Like in the case you face here.

    The example in full:

    $html = <<<HTML
    <p>
        <b>
            <span>zot; zotz </span>
        </b>
        <span>Nista; nula. Isto
            <b>zilch; zip.</b>
        </span>
    </p>
    HTML;
    
    $xml = simplexml_load_string($html);
    
    // first span element
    $span = dom_import_simplexml($xml->span);
    
    foreach ($span->childNodes as $child) {
        printf(" - %s : %s\n", get_class($child), $child->nodeValue );
    }