Search code examples
phphtmldomdocument

Save multiple HTML bodies as one using DOMDocument


I have a string containing multiple <html><body><div>Content</div></body></html> Tags. I want to get all Contents an join them to one valid Structure. For example:

<html><body><div>Content</div></body></html>
<html><body><div>Content</div></body></html>
<html><body><div>Content</div></body></html>

Should be:

<html>
    <body>
        <div>Content</div>
        <div>Content</div>
        <div>Content</div>
    </body>
</html>

My current Code looks like this:

    libxml_use_internal_errors(true);
    $newDom = new DOMDocument();

    $newBody = "";

    $newDom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));

    $bodyTags = $newDom->getElementsByTagName("body");

    foreach($bodyTags as $body) {
        $newBody .= $newDom->saveHTML($body);
    }

$newBody now contains all body Tags:

<body><div>Content</div></body>
<body><div>Content</div></body>
<body><div>Content</div></body>

How can I only save the HTML Content of each body Tag in $newBody?

Edit:

Based on @NigelRen s Answer this is my Solution:

    libxml_use_internal_errors(true);
    $newDom = new DOMDocument();

    $newBody = '';
    $newDom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));

    $bodyTags = $newDom->getElementsByTagName("body");

    foreach($bodyTags as $body) {
        foreach ($body->childNodes as $node)   {
            $newBody .= $newDom->saveHTML($node);
        }
    }

    $newDom = new DOMDocument();
    $newDom->loadHTML(mb_convert_encoding($newBody, 'HTML-ENTITIES', 'UTF-8'));
    $newBody = $newDom->saveHTML();

Solution

  • It's awkward as when you use loadHTML() it will attempt to fix the HTML in your original document. This creates a structure which isn't what you might think it is.

    BUT, if you have a basic outline of the document, the following will copy the contents of the <body> tags to a new document (comments in code)...

    $html = '<html><body><div>Content1</div></body></html>
    <html><body><div>Content2</div></body></html>
    <html><body><div>Content3</div></body></html>';
    
    libxml_use_internal_errors(true);
    $newDom = new DOMDocument();
    
    // New document with final code
    $newBody = new DOMDocument();
    
    $newDom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
    
    // Set up basic template for new doucument
    $newBody->loadHTML("<html><body /></html>");
    // Find where to add any new content
    $addBody = $newBody->getElementsByTagName("body")[0];
    // Find the existing content to add
    $bodyTags = $newDom->getElementsByTagName("body");
    foreach($bodyTags as $body) {
        // Add all of the contents of the <body> tag into the new document
        foreach ( $body->childNodes as $node )   {
            // Import the node to copy to the new document and add it in
            $addBody->appendChild($newBody->importNode($node, true));
        }
    }
    echo $newBody->saveHTML();
    

    which gives...

    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
    <html><body><div>Content1</div><div>Content2</div><div>Content3</div></body></html>
    

    The limitations are that any content outside of the <body> tags and any attributes of the <body> tag are not preserved.