I just wrote my first script for pjscrape, but I find that it runs terribly slow. I'm new to both pjscrape and phantomjs, so I don't know which one is the culprit.
I am loading the file from localhost, so the bottleneck is definitely not in the transfer.
My config.js script looks like this:
pjs.addSuite({
url: 'http://localhost/file.html'.
scraper: function() {
var people = $('table.person');
var results = [];
$.each(people, function() {
var $this = $(this);
results.push({
firstName: $this.find('.firstName').text(),
lastName: $this.find('.lastName').text(),
age: $this.find('.age').text()
});
}
return results;
}
}
Then I just execute PhantomJS using the command line instructions here.
~> phantomjs pjscrape.js config.js
I run the same code (just the scraper function() ) in Chrome and it is instant. In phantomjs/pjscrape, it takes a good 30 seconds.
Any clue what is causing the slowness?
Is there a better way to do this DOM screen scraping? Maybe a nodejs solution?
If Node.JS is an option, might I introduce you to cheerio? It's a great library for consuming questionably-formed HTML documents. It gives you a jQuery-like API for working with a DOM-like representation of the page you're scraping. Paired with request, it makes for a pretty easy environment for scraping HTML.
Your example would end up looking something like this (error handling left as an exercise for the reader):
var cheerio = require("cheerio"),
request = require("request");
request("http://localhost/file.html", function(err, res, data) {
var $ = cheerio.load(data);
var people = $('table.person');
var results = [];
$.each(people, function() {
var $this = $(this);
results.push({
firstName: $this.find('.firstName').text(),
lastName: $this.find('.lastName').text(),
age: $this.find('.age').text()
});
}
do_something_with(results);
});