I'm trying to parse an Html string that may contain any valid html tags. I used this code to parse the string:
$doc = new DOMDocument();
$doc->loadHTML($product['description']); // comes from db
$els = $doc->getElementsByTagName('*');
foreach ($els as $node) {
o($node->nodeName.' '.$node->nodeValue);
}
This does print my tags but the first two tags are html and body. I want to ignore those. The string from the db does not contain html or body tags. Here's an example:
<p>This is a paragraph</p>
<ol>
<li>This is a list</li>
</ol>
I was wondering if there's a way to iterate over tags inside the body only. I tried these
$els = $doc->getElementsByTagName('body *');
$body = $doc->getElementsByTagName('body');
$els = $body->getElementsByTagName('*');
Both don't work. I have seen others use xpath but that gives me headaches. Can it be done with DomDocument?
When you use DOMDocument::loadHTML()
in PHP, it automatically wraps the provided HTML fragment in <html>
and <body>
tags if they are not already present. This is because DOMDocument expects a complete HTML document structure.
The DOMDocument class doesn't support direct CSS-style selectors like body *
, but you can work around this by accessing the body element first and then getting its child nodes:
$doc = new DOMDocument();
$doc->loadHTML($product['description']); // comes from db
// Get the body element
$body = $doc->getElementsByTagName('body')->item(0);
// Check if the body element exists
if ($body) {
// Get all child elements of the body
$els = $body->getElementsByTagName('*');
foreach ($els as $node) {
echo($node->nodeName . ' ' . $node->nodeValue . "\n");
}
} else {
echo "Body tag not found.";
}