Search code examples
phpdomdocument

How to get all elements inside body with PHP DomDocument


I'm trying to parse an Html string that may contain any valid html tags. I used this code to parse the string:

$doc = new DOMDocument();
$doc->loadHTML($product['description']); // comes from db
$els = $doc->getElementsByTagName('*');
foreach ($els as $node) {
    o($node->nodeName.' '.$node->nodeValue);
}

This does print my tags but the first two tags are html and body. I want to ignore those. The string from the db does not contain html or body tags. Here's an example:

<p>This is a paragraph</p>
<ol>
    <li>This is a list</li>
</ol>

I was wondering if there's a way to iterate over tags inside the body only. I tried these

$els = $doc->getElementsByTagName('body *');

$body = $doc->getElementsByTagName('body');
$els = $body->getElementsByTagName('*');

Both don't work. I have seen others use xpath but that gives me headaches. Can it be done with DomDocument?


Solution

  • When you use DOMDocument::loadHTML() in PHP, it automatically wraps the provided HTML fragment in <html> and <body> tags if they are not already present. This is because DOMDocument expects a complete HTML document structure.

    The DOMDocument class doesn't support direct CSS-style selectors like body *, but you can work around this by accessing the body element first and then getting its child nodes:

    $doc = new DOMDocument();
    $doc->loadHTML($product['description']); // comes from db
    
    // Get the body element
    $body = $doc->getElementsByTagName('body')->item(0);
    
    // Check if the body element exists
    if ($body) {
        // Get all child elements of the body
        $els = $body->getElementsByTagName('*');
    
        foreach ($els as $node) {
            echo($node->nodeName . ' ' . $node->nodeValue . "\n");
        }
    } else {
        echo "Body tag not found.";
    }