Search code examples
phpcurlwebscreen-scraping

Extracting useful/readable content from a website


I am working on a application that needs to scrape a part of a website the user submits. I want to collect useful and readable content from the website and definitely not the whole site. If I look at applications that also do this (thinkery for example) I notice that that they somehow managed to create a way to scrape the website, try to guess what useful content is, show it in a readable format and they do that pretty fast.

I've been playing with cURL and I am getting pretty near the result I want but I have some issues and was wondering if someone has some more insights.

    $ch = curl_init('http://www.example.org');
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

    // $content contains the whole website
    $content = curl_exec($ch);

    curl_close($ch);

With the very simple code above I can scrape the whole website and with preg_match() I can try to find div's with the class, id or properties which contains the string 'content', 'summary' et cetera.

If preg_match() has result I can fairly guess that I have found relevant content and save this as the summary of the saved page. The problem I have is that cURL saves the WHOLE page in memory so this can take up a lot of time and resources. And I think that doing a preg_match() over such a large result can also take up a lot of time.

Is there a better way to achieve this?


Solution

  • I tried the DomDocument::loadHTMLFile as One Trick Pony suggested (Thanks!)

        $ch = curl_init('http://stackoverflow.com/questions/17180043/extracting-useful-readable-content-from-a-website');
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        $content = curl_exec($ch);
        curl_close($ch);
        $doc = new DOMDocument();
        @$doc->loadHTML($content);
    
        $div_elements = $doc->getElementsByTagName('div');
    
        if ($div_elements->length <> 0)
        {
            foreach ($div_elements as $div_element) 
            {
                if ($div_element->getAttribute('itemprop') == 'description')
                {
                    var_dump($div_element->nodeValue);
    
                }
            }
        }
    

    The result for above code is my question here on this page! Only thing left to do is find a good and consistent way to loop through or query the divs and determine if it is useful content.