Search code examples
phpdomdomdocumentgetelementsbytagname

How To get DiVs Level?


$html ='<html>
<head>
    <title></title>
</head>
<body>
    <div class="">
        <div class="">
           <p><strong><span style="color:#FF0000"> Content1 </span></strong></p>
           <p style="text-align:center"> Content2 <img src="https://example.com/bla1.jpg"/></p>
        </div>
       
        <h2> Header </h2>
        <div class=""><p><strong> Content3 </strong></p> </div>

    </div>

    <div class=""> Content4 </div>
    <div class="">
                   <p> Content5 </p>  
                   <p> Content6 </p> 
                   <span> blah.. </span>
    </div>
</body></html>';

I need to have such an array:

This means whether each DIV (including P) has a child or parent DIV ?


Solution

  • Yours is a nice attempt but I would rather prefer to get all p tags and then climb up the DOM node hierarchy if div is a parent of the current p node. This way, you would only collect those p nodes which has div as their parent and not otherwise. In other words, it is like the CSS selector div > p.

    $ps = array();
    $doc = new DomDocument('1.0', 'UTF-8');
    $doc->loadHTML(mb_convert_encoding($HTML, 'HTML-ENTITIES', 'UTF-8'));
    
    foreach($doc->getElementsByTagName('p') as $p){
       $curr_node = $p->parentNode;
       while(property_exists($curr_node,'tagName')){
          if($curr_node->tagName == 'div'){
            $ps[] = $p;
            break;
          }
          $curr_node = $curr_node->parentNode;
          if($curr_node === null) break;
       }
    }
    
    print_r($ps);
    

    Update #1:

    To get ps per div, you can recursively walk through all child nodes per div and collect all ps and add it to result as below:

    function getPs($node,&$result){
        foreach ($node->childNodes as $c_node) {
            if(property_exists($c_node, 'tagName') && $c_node->tagName == 'p'){
                $result[] = $c_node;
            }
            getPs($c_node,$result);
        }
    }
    
    $ps = [];
    
    foreach($doc->getElementsByTagName('div') as $div){
       $child_ps = [];
       getPs($div,$child_ps);
       if(count($child_ps) > 0) $ps[] = $child_ps;
    }
    
    echo "<pre>";
    print_r($ps);
    

    Update #2:

    To get the HTML string representation of the p node, change

    $result[] = $c_node;
    

    to

    $result[] = $c_node->ownerDocument->saveXML( $c_node );