Search code examples

Extracting useful/readable content from a website

I am working on a application that needs to scrape a part of a website the user submits. I want to collect useful and readable content from the website and definitely not the whole site. If I look at applications that also do this (thinkery for example) I notice that that they somehow managed to create a way to scrape the website, try to guess what useful content is, show it in a readable format and they do that pretty fast.

I've been playing with cURL and I am getting pretty near the result I want but I have some issues and was wondering if someone has some more insights.

    $ch = curl_init('');
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

    // $content contains the whole website
    $content = curl_exec($ch);


With the very simple code above I can scrape the whole website and with preg_match() I can try to find div's with the class, id or properties which contains the string 'content', 'summary' et cetera.

If preg_match() has result I can fairly guess that I have found relevant content and save this as the summary of the saved page. The problem I have is that cURL saves the WHOLE page in memory so this can take up a lot of time and resources. And I think that doing a preg_match() over such a large result can also take up a lot of time.

Is there a better way to achieve this?


  • I tried the DomDocument::loadHTMLFile as One Trick Pony suggested (Thanks!)

        $ch = curl_init('');
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        $content = curl_exec($ch);
        $doc = new DOMDocument();
        $div_elements = $doc->getElementsByTagName('div');
        if ($div_elements->length <> 0)
            foreach ($div_elements as $div_element) 
                if ($div_element->getAttribute('itemprop') == 'description')

    The result for above code is my question here on this page! Only thing left to do is find a good and consistent way to loop through or query the divs and determine if it is useful content.