I'd like to get all articles from the webpage, as well as get all pictures for the each article.
I decided to use PHP Simple HTML DOM Parse and I used the following code:
<?php
include("simple_html_dom.php");
$sitesToCheck = array(
array(
'url' => 'http://googleblog.blogspot.ru/',
'search_element' => 'h2.title a',
'get_element' => 'div.post-content'
),
array(
// 'url' => '', // Site address with a list of of articles
// 'search_element' => '', // Link of Article on the site
// 'get_element' => '' // desired content
)
);
$s = microtime(true);
foreach($sitesToCheck as $site)
{
$html = file_get_html($site['url']);
foreach($html->find($site['search_element']) as $link)
{
$content = '';
$savePath = 'cachedPages/'.md5($site['url']).'/';
$fileName = md5($link->href);
if ( ! file_exists($savePath.$fileName))
{
$post_for_scan = file_get_html($link->href);
foreach($post_for_scan->find($site["get_element"]) as $element)
{
$content .= $element->plaintext . PHP_EOL;
}
if ( ! file_exists($savePath) && ! mkdir($savePath, 0, true))
{
die('Unable to create directory ...');
}
file_put_contents($savePath.$fileName, $content);
}
}
}
$e = microtime(true);
echo $e-$s;
I will try to get only articles without pictures. But I get the response from the server
"Maximum execution time of 120 seconds exceeded"
.
What I'm doing wrong? Is there any other way to get all the articles and all pictures for each article for a specific webpage?
I had similar problems with that lib. Use PHP's DOMDocument instead:
$doc = new DOMDocument;
$doc->loadHTML($html);
$links = $doc->getElementsByTagName('a');
foreach ($links as $link) {
doSomethingWith($link->getAttribute('href'), $link->nodeValue);
}
See http://www.php.net/manual/en/domdocument.getelementsbytagname.php