I am in the process of writing a site scraper to grab some specific content from an ajax site with no actual links, only clickable text. I've only been using javascript for about a week now and am using CasperJS as it will cut out a lot of work.
The problem I am finding is that I'm writing multiple functions which all do the same thing, just search for different links depending on the page it's on. So I have:
function getLinks() {
var links = document.querySelectorAll('div.AjaxLink h3');
return Array.prototype.map.call(links, function(link) {
return link.innerText;
});
}
Its run via:
casper.then(function() {
var myLinks = this.evaulate(getLinks);
/* ... link manipulation code code ... */
});
This works fine. I obviously don't want to have half a dozen functions which simply have a different query string. So what I want to do is:
function getLinks(findText) {
var links = document.querySelectorAll(findText);
return Array.prototype.map.call(links, function(link) {
return link.innerText;
});
}
Then I am trying to run it via:
casper.then(function() {
var myLinks = getLinks('div.AjaxLink h3');
/* ... link manipulation code code ... */
});
The findText variable is passed in properly but it appears the query selector always returns an empty NodeList.
What am I doing wrong? Is document an empty document created inside that function?
CasperJS is built on top of PhantomJS. PhantomJS has two contexts. The sandboxed page context that is accessible through evaluate()
and the outer context which has access to require
and phantom
. Strangely enough, both contexts have access to window
and document
, but document
doesn't mean anything in the outer context, because the DOM is empty. That is why querySelectorAll()
doesn't find an element. The page DOM can only be accessed through evaluate()
.
So you need to execute your function in casper.evaluate()
. The additional argument for your function is passed into evaluate()
and not your function:
function getLinks(findText) {
...
}
casper.then(function() {
var myLinks = this.evaluate(getLinks, 'div.AjaxLink h3'); // THIS
...
});
There is also an important note at the bottom of the evaluate
page:
Note: The arguments and the return value to the
evaluate
function must be a simple primitive object. The rule of thumb: if it can be serialized via JSON, then it is fine.Closures, functions, DOM nodes, etc. will not work!