I'm using CasperJS for web scraping, but I encountered some problems on scraping the page I describe below.
The html of the page looks like this:
<img id="trigger">
<img id="cur_img_xxx" class="show">
<img id="cur_img_yyy" class="cache">
All <img>
elements share the same dimensions, and "#trigger"
is on the topmost layer. When an image has .show
class, it will display on the page; when it's .cache
class, it will get downloaded but hide. In this way, when user click on the image, which is actually the trigger, next image will show and a new image will be downloaded via AJAX. The resulted html becomes:
<img id="trigger">
<img id="cur_img_xxx" class="cache">
<img id="cur_img_yyy" class="show">
<img id="cur_img_zzz" class="cache">
I guess it's a good strategy to increase the UX, and good for avoiding web scraping, but I still want to scrape :P
I tried $("#trigger").click()
in the web console, and the images get navigated and downloaded corrected. However, when I tried to simulate this process using CasperJS, neither the navigation nor the image downloading worked. Please refer to the code:
var casper = require ("casper").create({
clientScripts: [
'include/jquery.js'
],
pageSettings: {
loadImages: false, // this won't affect since this will only forbid
loadPlugins: false // inline imgs from loading, but all imgs in this
}, // page are loaded dynamically
verbose: true
});
casper.start("http://www.example.com/1234.html");
casper.then(function () {
console.log("Connected! Current Url = " + this.getCurrentUrl());
});
casper.then(function () {
// findInitialImgs will find imgs that have already been loaded
imgs = this.evaluate(findInitialImgs);
this.waitForSelector("#image_trigger").thenClick("#image_trigger");
var next = this.evaluate(function () {
return $("img[id^='cur_img_']").last().attr("href");
});
console.log(next);
});
casper.run(function () {
this.echo('End').exit();
});
By right, after "#trigger"
is clicked, the last entry would be different, i.e. from <img id="cur_img_yyy">
becomes <img id="cur_img_zzz">
. However, next
still held <img id="cur_img_yyy">
. Did I do anything wrong?
It seems to be JQuery
's problem. After I deleted JQuery
injection, and changed $("img[id^='cur_img_']").last().attr("href")
to
var imgs = document.querySelectorAll("img[id^='cur_img_']");
return imgs[imgs.length - 1].getAttribute("href");
Everything works fine.
Then I found this answer very powerful: CasperJS click event having AJAX call
So confirmed that the original scripts will be broken when you inject JQuery
to pages that use $
as JQuery
.