I'd like to know how can be scraped in a loop (page 1 page 2etc....) a webpage which has infinite loops (like imgur) for example ... ?
I tried the code below, but it returns only the first page. How can I trigger the next page due to infinite scrolling template?
<?php
$mr = $maxredirect === null ? 10 : intval($maxredirect);
if (ini_get('open_basedir') == '' && ini_get('safe_mode' == 'Off')) {
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, $mr > 0);
curl_setopt($ch, CURLOPT_MAXREDIRS, $mr);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
} else {
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
if ($mr > 0) {
$original_url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
$newurl = $original_url;
$rch = curl_copy_handle($ch);
curl_setopt($rch, CURLOPT_HEADER, true);
curl_setopt($rch, CURLOPT_NOBODY, true);
curl_setopt($rch, CURLOPT_FORBID_REUSE, false);
do {
curl_setopt($rch, CURLOPT_URL, $newurl);
$header = curl_exec($rch);
if (curl_errno($rch)) {
$code = 0;
} else {
$code = curl_getinfo($rch, CURLINFO_HTTP_CODE);
if ($code == 301 || $code == 302) {
preg_match('/Location:(.*?)\n/', $header, $matches);
$newurl = trim(array_pop($matches));
// if no scheme is present then the new url is a
// relative path and thus needs some extra care
if(!preg_match("/^https?:/i", $newurl)){
$newurl = $original_url . $newurl;
}
} else {
$code = 0;
}
}
} while ($code && --$mr);
curl_close($rch);
if (!$mr) {
if ($maxredirect === null)
trigger_error('Too many redirects.', E_USER_WARNING);
else
$maxredirect = 0;
return false;
}
curl_setopt($ch, CURLOPT_URL, $newurl);
}
}
return curl_exec($ch);
}
$ch = curl_init('http://www.imgur.com');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$data = curl_exec_follow($ch);
curl_close($ch);
echo $data;
?>
cURL
works by getting the source code of a webpage. Your code will gather the HTML only from the original webpage. In the case of imgur, it will include ~40 images, plus the rest of the page layout.
This original source code doesn't change when you scroll down. However, the HTML inside of your browser does. This is done with AJAX. The page that you are looking at requests information from a second page.
If you use FireBug (for FireFox) or Google Chrome's page inspector, then you can monitor these requests by going to the Net or Network tab (respectively). When you scroll down, the page will make another ~45 requests or so (mostly for images). You'll also see that it requests this page:
http://imgur.com/gallery/hot/viral/day/page/0?scrolled&set=1
The JavaScript on the imgur homepage appends this HTML to the bottom of the home page. You would probably want to query this page (or the API, as the other Chris said) if you want to get a list of images. You can play with the numbers at the end of the URL to get more images.