Search code examples
phpdomdomdocumentdomxpath

Extract content of specific div preserving only certain elements


I need to extract only textual part of the webpage preserving all and only the <p> <h2>, <h3>, <h4> and <blockquote>s.

Now, using DOMXPath and $div = $xpath->query('//div[@class="story-inner"]'); gives lots of unwanted page elements like pictures, ad blocks, other custom markups, etc. inside of text div.

On the other hand using the following code:

$items = $doc->getElementsByTagName('<p>');
 for ($i = 0; $i < $items->length; $i++) {
    echo $items->item($i)->nodeValue . "<p>";
}

gives very nice and clean result very close what I wanted, but with <h2>, <h3>, <h4> and <blockquotes> missing.

I wonder is there any DOM-way of (1) indicating only desired page elements and extracting clean result or (2) efficient way of cleaning up the output obtained by using $div = $xpath->query('//div[@class="story-inner"]');?


Solution

  • You could use an OR inside your xpath query in this case. Just cascade those tags with it get those only desired ones.

    $url = "http://www.example.com/russian/international/2015/02/150218_ukraine_debaltseve_fighting";
    $curl = curl_init($url);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
    $html = curl_exec($curl);
    curl_close($curl);
    
    $doc = new DOMDocument();
    $html = mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8");
    @$doc->loadHTML($html);
    
    $xpath = new DOMXPath($doc);
    
    $tags = array('p', 'h2');
    $children_needed = implode(' or ', array_map(function($tag){ return sprintf('name()="%s"', $tag); }, $tags));
    $query = "//div[@class='story-body__inner']//*[$children_needed]";
    $div_children = $xpath->query($query);
    if($div_children->length > 0) {
        foreach($div_children as $child) {
            echo $doc->saveHTML($child);
        }
    }