I am scraping data from facebook page for the wall posts, here is the url:
http://www.facebook.com/GMHTheBook?v=wall&ref=ts#!/GMHTheBook?v=wall&ref=ts
I sucessfully scraped all the visible wall posts using CURL.
Problem:
At the end of visible wall posts, there is Older Posts link which shows more wall posts once you click on that link. Now how do I sort of manually click that link to show more wall posts and scrap those posts as well?
Any solution using any method for that? I am using CURL though but I hope there is just about any solution to deal with such situation?
Now I am using this code to get all the data, find the next link and fetch the data for that url and so on, here is the code:
ini_set('display_errors', true);
error_reporting(E_ALL);
$data = json_decode(file_get_contents(($url)), true);
$names = array();
$stories = array();
foreach($data['data'] as $post)
{
$names[] = $post['from']['name'];
$stories[] = $post['message'];
}
$url = $data['paging']['next'];
// this is meant to scrap data recurssively from the next links
while($url !== '')
{
$url = $data['paging']['next'];
$data = json_decode(file_get_contents(($url)), true);
foreach($data['data'] as $post)
{
$names[] = $post['from']['name'];
$stories[] = $post['message'];
}
$url = urldecode($data['paging']['next']);
echo $url . '<br />';
}
for($j = 0; $j < count($names); $j++)
{
$data .= $names[$j] . '|' . $stories[$j] . "\n";
}
$h = fopen("data.txt", "a+");
fwrite($h, $data);
fclose($h);
But the problem is that script keeps on running with no output at all, also no file is created. I have set the script time settings to higher value too. allow_url_fopen
is also set to on. Is there anything wrong in the script or probably I am not doing the recurssion in the right way? Any solution/alternative to this?
You should use the Graph API. The data you are scraping is available in JSON format at
and contains links for getting previous/next pages, e.g. paging.
Example:
$data = json_decode(file_get_contents(($url)));
foreach($data->data as $post) {
echo $post->from->name, ': ',
$post->message,
PHP_EOL;
}
The above will output all the posts on the wall. For paging do
echo $data->paging->previous;
echo $data->paging->next;
This will output two URLs. All you have to do is load them again.