Search code examples
phphtml-content-extraction

Strip and fetch text content from each nested div on a page


I have got an HTML from the URL. What I want to achieve is fetching only plain text content inside a div. Any idea if it can be achieved. The structure will be similar to this

<div class="first">
  <div class="second">
     Some content inside second div
    <div class="third">
      Some more content inside third div
    </div>
  </div>
</div>

When I extract content I want to fetch plain text content inside an array something like this

Array(
 [first]=>
 [second]=>Some content inside second div
 [third]=>Some more content inside third div
);

I am trying to achieve this using strip_tags but somehow I am confused about splitting it up and adding it to an array. Anybody who might have any idea please help.


Solution

  • <?php
    function clearArray($arr) {
        if(is_array($arr)) {
            foreach($arr as $element) {
                $cont=trim($element); //make sure to have cr/lf parts removed (difference between line seperator)
                if(!empty($cont)) {
                    $newArray[]=$cont;
                }
            }
            return $newArray;
        }
        return false;
    }
    $content='<div class="first">
      <div class="second">
         Some content inside second div
        <div class="third">
          Some more content inside third div
        </div>
      </div>
    </div>';
    $strippedContent=strip_tags($content);
    $content=explode("\n", $strippedContent);
    $content=clearArray($content);
    print_r($content);
    

    This will output that:

    Array ( [0] => Some content inside second div [1] => Some more content inside third div )
    

    If you are retrieving this information from a foreign page I'd strongly recommend you to use DOMDocument and xpath to get the elements.