Search code examples
phpdomdocument

How to get nested divs , P Values with DomDocument - PHP


I want to access the P elements from the innermost DIVs. That is, the contents of DIVs that do not have a DIV child. Is this possible with getElementsByTagName?

$html = '<html>
    <head>
        <title></title>
    </head>
    <body>
      <div class="">
            <div class="">
                   <p>  Content1  </p>
                   <p>  Content2  </p>
                
                        <div class="">
                               <p>  Content3  </p>
                               <p>  Content4  </p>
                        </div>
            </div>
    
          <p>  Content5  </p>
          <h2> Header </h2>
          <div class=""><p><strong> Content6 </strong></p> </div>
    
      </div>
    
        <div class=""> <p> Content7 </p></div>
        <div class="">
                       <p> Content8 </p>  
                       <p> Content9 </p> 
    
                       <div class="">
                              <p> Content10 </p>  
                       </div> 
              <span> blah.. </span>
        </div>
    </body></html>';

The expected output is as follows:

Array
(
    [0] => Array
        (
            [0] =>   Content3  
            [1] =>   Content4  
        )

    [1] => Array
        (
            [0] =>  Content6 
        )

    [2] => Array
        (
            [0] =>  Content7 
        )

    [3] => Array
        (
            [0] =>  Content10 
        )
)

Solution

  • Extending from my answer here, you will have to perform two additional steps.

    • Check if the current parent div at hand doesn't have any other child divs.
    • Group p tags based on div nodes for which you can make use of spl_object_id to match p nodes with the same parent div nodes they belong to.

    Snippet:

    $ps = [];
    $doc = new DomDocument('1.0', 'UTF-8');
    $doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
    
    foreach($doc->getElementsByTagName('p') as $p){
        $curr_node = $p->parentNode;
        while(property_exists($curr_node,'tagName')){
            if($curr_node->tagName == 'div'){
                if(isInnerMostChildDiv($curr_node)){
                    if(!isset($ps[spl_object_id($curr_node)])) $ps[spl_object_id($curr_node)] = [];
                    $ps[spl_object_id($curr_node)][] = $p->nodeValue;
                }            
                break;
            }
            $curr_node = $curr_node->parentNode;
            if($curr_node === null) break;
        }
    }
    
    function isInnerMostChildDiv($div_node){
        foreach($div_node->childNodes as $c_node){
            if(property_exists($c_node,'tagName') && $c_node->tagName == 'div' || !isInnerMostChildDiv($c_node)){
                return false;
            }
        }
        return true;
    }
    
    $ps = array_values($ps);
    
    print_r($ps);