Search code examples
phpfacebookweb-scraping

Data Scraping Problem


I am scraping data from facebook page for the wall posts, here is the url:

http://www.facebook.com/GMHTheBook?v=wall&ref=ts#!/GMHTheBook?v=wall&ref=ts

I sucessfully scraped all the visible wall posts using CURL.

Problem:

At the end of visible wall posts, there is Older Posts link which shows more wall posts once you click on that link. Now how do I sort of manually click that link to show more wall posts and scrap those posts as well?

Any solution using any method for that? I am using CURL though but I hope there is just about any solution to deal with such situation?

Update:

Now I am using this code to get all the data, find the next link and fetch the data for that url and so on, here is the code:

ini_set('display_errors', true);
error_reporting(E_ALL);

$data = json_decode(file_get_contents(($url)), true);

$names = array();
$stories = array();

foreach($data['data'] as $post)
{
    $names[] = $post['from']['name'];
    $stories[] = $post['message'];
}

$url = $data['paging']['next'];

// this is meant to scrap data recurssively from the next links
while($url !== '')
{
    $url = $data['paging']['next'];
    $data = json_decode(file_get_contents(($url)), true);

    foreach($data['data'] as $post)
    {
        $names[] = $post['from']['name'];
        $stories[] = $post['message'];
    }

    $url = urldecode($data['paging']['next']);
    echo $url . '<br />';
}


for($j = 0; $j < count($names); $j++)
{
  $data .= $names[$j] . '|' . $stories[$j] . "\n";
}

$h = fopen("data.txt", "a+");
fwrite($h, $data);
fclose($h);

But the problem is that script keeps on running with no output at all, also no file is created. I have set the script time settings to higher value too. allow_url_fopen is also set to on. Is there anything wrong in the script or probably I am not doing the recurssion in the right way? Any solution/alternative to this?


Solution

  • You should use the Graph API. The data you are scraping is available in JSON format at

    and contains links for getting previous/next pages, e.g. paging.

    Example:

    $data = json_decode(file_get_contents(($url)));
    foreach($data->data as $post) {
        echo $post->from->name, ': ',
             $post->message,
             PHP_EOL;
    }
    

    The above will output all the posts on the wall. For paging do

    echo $data->paging->previous;
    echo $data->paging->next;
    

    This will output two URLs. All you have to do is load them again.