Search code examples
phpweb-scrapingquerypath

Why my PHP QueryPath 2.1.2 WAMP scraping script only returns 5 articles instead of 43? Timeout?


I am trying to scrape 43 blogs posts from my blog and store them in array but when I print_r the array it only returns first 5 [with the rest empty] instead of all 43. Why? And How I can get all 43? I run this script from cmd.exe [command line] on WAMP.

    <?php

require 'src/QueryPath/QueryPath.php';


$qp1 = htmlqp('http://myblog.com/blog');
$qp2 = htmlqp('http://myblog.com/blog/Page-2.html');
$qp3 = htmlqp('http://myblog.com/blog/Page-3.html');
$qp4 = htmlqp('http://myblog.com/blog/Page-4.html');

foreach ($qp1->find('ol>li a[href],.jbReadon') as $item) {
    $links[] = $item->attr('href');
}

foreach ($qp2->find('ol>li a[href],.jbReadon') as $item) {
    $links[] = $item->attr('href');
}

foreach ($qp3->find('ol>li a[href],.jbReadon') as $item) {
    $links[] = $item->attr('href');
}

foreach ($qp4->find('ol>li a[href],.jbReadon') as $item) {
    $links[] = $item->attr('href');
}


print_r($links);



foreach ($links as $link) {
    $url = "http://myblog.com".$link;

    $content[] = htmlqp($url)->find('.jbIntroText p')->text();
}
print_r($content);




?>

after key 5 of the array onwards, all the values are empty. [I couldnt upload the image either from laptop or web so heres the link to screenshot of cmd.exe] http://img546.imageshack.us/img546/6092/cmdafter5arrayisempty.jpg

I am obviously a beginner so any suggestions how to make this code more succint or how to better accomplish my scraping prototype would be appreciated. All constructive criticism welcome as well :-P


Solution

  • You might want to add some print statements to at least one of those FOR loops. Several things could be going on here. The two most likely are:

    • The filter may only be matching five items.
    • The HTML parser may be choking on some markup. In this case, it will attempt to load as much of the HTML DOM as it can.

    By adding in some print statements, you might be able to see how many times it is iterating.

    And as an aside, if you're trying to get the list of articles on your blog, reading the RSS or Atom feed might be easier (though I suppose it might not have all the info you need).