I want to access the P elements from the innermost DIVs. That is, the contents of DIVs that do not have a DIV child. Is this possible with getElementsByTagName?
$html = '<html>
<head>
<title></title>
</head>
<body>
<div class="">
<div class="">
<p> Content1 </p>
<p> Content2 </p>
<div class="">
<p> Content3 </p>
<p> Content4 </p>
</div>
</div>
<p> Content5 </p>
<h2> Header </h2>
<div class=""><p><strong> Content6 </strong></p> </div>
</div>
<div class=""> <p> Content7 </p></div>
<div class="">
<p> Content8 </p>
<p> Content9 </p>
<div class="">
<p> Content10 </p>
</div>
<span> blah.. </span>
</div>
</body></html>';
The expected output is as follows:
Array
(
[0] => Array
(
[0] => Content3
[1] => Content4
)
[1] => Array
(
[0] => Content6
)
[2] => Array
(
[0] => Content7
)
[3] => Array
(
[0] => Content10
)
)
Extending from my answer here
, you will have to perform two additional steps.
child divs
.p
tags based on div nodes for which you can make use of spl_object_id
to match p
nodes with the same parent div nodes they belong to.Snippet:
$ps = [];
$doc = new DomDocument('1.0', 'UTF-8');
$doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
foreach($doc->getElementsByTagName('p') as $p){
$curr_node = $p->parentNode;
while(property_exists($curr_node,'tagName')){
if($curr_node->tagName == 'div'){
if(isInnerMostChildDiv($curr_node)){
if(!isset($ps[spl_object_id($curr_node)])) $ps[spl_object_id($curr_node)] = [];
$ps[spl_object_id($curr_node)][] = $p->nodeValue;
}
break;
}
$curr_node = $curr_node->parentNode;
if($curr_node === null) break;
}
}
function isInnerMostChildDiv($div_node){
foreach($div_node->childNodes as $c_node){
if(property_exists($c_node,'tagName') && $c_node->tagName == 'div' || !isInnerMostChildDiv($c_node)){
return false;
}
}
return true;
}
$ps = array_values($ps);
print_r($ps);