I am working on a application that needs to scrape a part of a website the user submits. I want to collect useful and readable content from the website and definitely not the whole site. If I look at applications that also do this (thinkery for example) I notice that that they somehow managed to create a way to scrape the website, try to guess what useful content is, show it in a readable format and they do that pretty fast.
I've been playing with cURL and I am getting pretty near the result I want but I have some issues and was wondering if someone has some more insights.
$ch = curl_init('http://www.example.org');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// $content contains the whole website
$content = curl_exec($ch);
curl_close($ch);
With the very simple code above I can scrape the whole website and with preg_match() I can try to find div's with the class, id or properties which contains the string 'content', 'summary' et cetera.
If preg_match() has result I can fairly guess that I have found relevant content and save this as the summary of the saved page. The problem I have is that cURL saves the WHOLE page in memory so this can take up a lot of time and resources. And I think that doing a preg_match() over such a large result can also take up a lot of time.
Is there a better way to achieve this?
I tried the DomDocument::loadHTMLFile as One Trick Pony suggested (Thanks!)
$ch = curl_init('http://stackoverflow.com/questions/17180043/extracting-useful-readable-content-from-a-website');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$content = curl_exec($ch);
curl_close($ch);
$doc = new DOMDocument();
@$doc->loadHTML($content);
$div_elements = $doc->getElementsByTagName('div');
if ($div_elements->length <> 0)
{
foreach ($div_elements as $div_element)
{
if ($div_element->getAttribute('itemprop') == 'description')
{
var_dump($div_element->nodeValue);
}
}
}
The result for above code is my question here on this page! Only thing left to do is find a good and consistent way to loop through or query the divs and determine if it is useful content.