I am trying to scrape 43 blogs posts from my blog and store them in array but when I print_r the array it only returns first 5 [with the rest empty] instead of all 43. Why? And How I can get all 43? I run this script from cmd.exe [command line] on WAMP.
<?php
require 'src/QueryPath/QueryPath.php';
$qp1 = htmlqp('http://myblog.com/blog');
$qp2 = htmlqp('http://myblog.com/blog/Page-2.html');
$qp3 = htmlqp('http://myblog.com/blog/Page-3.html');
$qp4 = htmlqp('http://myblog.com/blog/Page-4.html');
foreach ($qp1->find('ol>li a[href],.jbReadon') as $item) {
$links[] = $item->attr('href');
}
foreach ($qp2->find('ol>li a[href],.jbReadon') as $item) {
$links[] = $item->attr('href');
}
foreach ($qp3->find('ol>li a[href],.jbReadon') as $item) {
$links[] = $item->attr('href');
}
foreach ($qp4->find('ol>li a[href],.jbReadon') as $item) {
$links[] = $item->attr('href');
}
print_r($links);
foreach ($links as $link) {
$url = "http://myblog.com".$link;
$content[] = htmlqp($url)->find('.jbIntroText p')->text();
}
print_r($content);
?>
after key 5 of the array onwards, all the values are empty. [I couldnt upload the image either from laptop or web so heres the link to screenshot of cmd.exe] http://img546.imageshack.us/img546/6092/cmdafter5arrayisempty.jpg
I am obviously a beginner so any suggestions how to make this code more succint or how to better accomplish my scraping prototype would be appreciated. All constructive criticism welcome as well :-P
You might want to add some print statements to at least one of those FOR loops. Several things could be going on here. The two most likely are:
By adding in some print statements, you might be able to see how many times it is iterating.
And as an aside, if you're trying to get the list of articles on your blog, reading the RSS or Atom feed might be easier (though I suppose it might not have all the info you need).