Search code examples
javascriptnode.jsscreen-scrapingphantomjs

Is it pjscrape that is slow, or is it PhantomJS? Alternative scraper?


I just wrote my first script for pjscrape, but I find that it runs terribly slow. I'm new to both pjscrape and phantomjs, so I don't know which one is the culprit.

I am loading the file from localhost, so the bottleneck is definitely not in the transfer.

My config.js script looks like this:

pjs.addSuite({
    url: 'http://localhost/file.html'.
    scraper: function() {
        var people = $('table.person');
        var results = [];

        $.each(people, function() {
            var $this = $(this);
            results.push({ 
                firstName: $this.find('.firstName').text(),
                lastName: $this.find('.lastName').text(),
                age: $this.find('.age').text()
            });
        }

        return results;

    }
}

Then I just execute PhantomJS using the command line instructions here.

~> phantomjs pjscrape.js config.js

I run the same code (just the scraper function() ) in Chrome and it is instant. In phantomjs/pjscrape, it takes a good 30 seconds.

Any clue what is causing the slowness?

Is there a better way to do this DOM screen scraping? Maybe a nodejs solution?


Solution

  • If Node.JS is an option, might I introduce you to cheerio? It's a great library for consuming questionably-formed HTML documents. It gives you a jQuery-like API for working with a DOM-like representation of the page you're scraping. Paired with request, it makes for a pretty easy environment for scraping HTML.

    Your example would end up looking something like this (error handling left as an exercise for the reader):

    var cheerio = require("cheerio"),
        request = require("request");
    
    request("http://localhost/file.html", function(err, res, data) {
      var $ = cheerio.load(data);
    
      var people = $('table.person');
      var results = [];
    
      $.each(people, function() {
        var $this = $(this);
    
        results.push({ 
          firstName: $this.find('.firstName').text(),
          lastName: $this.find('.lastName').text(),
          age: $this.find('.age').text()
        });
      }
    
      do_something_with(results);
    });