Search code examples
phpdomdocument

How to remove in PHP outer tags from a node


I have the following html code:

$pageHTML = '<html>
<head></head>
<body>
<div class="some class">
<header>Header</header>
<section>Section</section>
<footer>Footer</footer>
</div>
</body>
</html>';

and I need to remove outer tags of the <div> keeping all its inner HTML inside of the <body>

If I try

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($pageHTML);
libxml_use_internal_errors(false);

$bodyDivs = [];
foreach($dom->getElementsByTagName('body')[0]->childNodes as $bodyChild) {
    if($bodyChild->nodeName == 'div') {
        $bodyDivs[] = $bodyChild;
    }
}

if(count($bodyDivs) == 1) {
    foreach($bodyDivs[0]->childNodes as $divChild) {
        $dom->getElementsByTagName('body')[0]->appendChild($divChild);
    }
    $dom->getElementsByTagName('body')[0]->removeChild($bodyDivs[0]);
}

the div is being removed but without appending its childs to <body> before the removing

If I try a reverse loop like

$k = count($bodyDivs[0]->childNodes);
for($n = $k-1; $n >= 0; $n--) {
    $dom->getElementsByTagName('body')[0]->appendChild($bodyDivs[0]->childNodes[$n]);
}
$dom->getElementsByTagName('body')[0]->removeChild($bodyDivs[0]);

the childs are being added to the body, but in reverse order

So I get

<body>
<footer>Footer</footer>
<section>Section</section>
<header>Header</header>
</body>

but I need

<body>
<header>Header</header>
<section>Section</section>
<footer>Footer</footer>
</body>

How to resolve the problem?


Solution

  • Your original code is very close, just missing one key point.

    Original code

    foreach($bodyDivs[0]->childNodes as $divChild) {
        $dom->getElementsByTagName('body')[0]->appendChild($divChild);
    }
    

    Trying to foreach a list of nodes, while also removing nodes from that same list (in your case, moving them to the <body>), does not behave as you intended.

    Simplified, complete example for demonstration purposes:

    <?php
    $doc = new DOMDocument;
    $doc->loadXML('<example><a/><b/><c/><d/><e/></example>');
    $parent = $doc->documentElement;
    foreach ($parent->childNodes as $child) {
        $parent->removeChild($child);
    }
    echo $doc->saveXML();
    

    This outputs the following:

    <?xml version="1.0"?>
    <example><b/><c/><d/><e/></example>
    

    Totally sensible, right?! Fear not, we can do better.

    What to do?

    A common approach, that does behave as intended, is to loop over the list until it is empty.

    <?php
    $doc = new DOMDocument;
    $doc->loadXML('<example><a/><b/><c/><d/><e/></example>');
    $parent = $doc->documentElement;
    while ($parent->childNodes->length > 0) {
        $child = $parent->childNodes->item(0);
        $parent->removeChild($child);
    }
    echo $doc->saveXML();
    

    Applied to your code

    All of the above means that your original foreach:

    foreach($bodyDivs[0]->childNodes as $divChild) {
        $dom->getElementsByTagName('body')[0]->appendChild($divChild);
    }
    

    Can be replaced with a while loop.

    while ($bodyDivs[0]->childNodes->length > 0) {
        $divChild = $bodyDivs[0]->childNodes->item(0);
        $dom->getElementsByTagName('body')->item(0)->appendChild($divChild);
    }
    

    Aside: I used the ->item(0) notation above, as that's more conventional.