Search code examples
javascriptjsonweb-scrapingcasperjs

CasperJS loop through table and scrape data for JSON output


I'm actually trying to get some data from a website in CasperJs. The datas are stocked in a table. I'm trying to get a proper JSON file after the scrap. A json with : - name of the company, - mail, - website - description of activity.

Until now I've been able to open the page and get the data but not precisely (mail and website are on the same ). So I've found how to select precisely each element I want. But in this case I don't get all table information's, only first row...

I would know if somebody could help me, telling me where to look or how to make loop in my case ? Assume I'm not a professional developper, I'm training myself.

Here my code :

var casper = require('casper').create();
var url = 'http://www.rent2016.fr/pages/exposants';
var fs = require('fs');
var length;

casper.start(url);

casper.then(function() {
    this.waitForSelector('table#myTable');
});

casper.then(function(){
    var info = this.evaluate(function(){
        var table_rows = document.querySelectorAll("tr"); //or better selector

        return Array.prototype.map.call(table_rows, function(tr){
            return {

                nom : document.querySelector(".td-width h3").textContent,
                description: document.querySelector(".td-width p").textContent,
                mail : document.querySelector("td span a").textContent,
                site : document.querySelector('td span a[href^="http"]').textContent,



            };
        });
    });

  fs.write('test_rent_stringify.json', JSON.stringify(info), 'w');
    this.echo(JSON.stringify(info, undefined, 4));

});


casper.run(function() {

});

Here, we don't have loop : JSON repeat the first row information's. To get every rows informations you have to replace

nom : document.querySelector(".td-width h3").textContent,

by

 nom : tr.children[1].textContent,

but in this case you can't precisely target the H3, the links... you get all the information. So actually I can :

  • loop through the rows and get informations, but they unusable

  • have only the first row informations but with good presentation

Thanks in advance !


Solution

  • In order to take information inside every element, you have to use tr.querySelector rather than document.querySelector.

    The following loop works fine with the page:

    var table_rows = document.querySelectorAll("tbody tr"); //or better selector
    return Array.prototype.map.call(table_rows, function(tr) {
        return {
            nom: tr.querySelector(".td-width h3").textContent,
            description: tr.querySelector(".td-width p").textContent,
            mail: tr.querySelector('td span a[href^="mailto"]').textContent,
            site: tr.querySelector('td span a:not([href^="mailto"])').textContent
        };
    });