Search code examples
javascriptnode.jsasynchronousweb-scraping

Node.js: requesting a page and allowing the page to build before scraping


I've seen some answers to this that refer the askee to other libraries (like phantom.js), but I'm here wondering if it is at all possible to do this in just node.js?

Considering my code below. It requests a webpage using request, then using cheerio it explores the dom to scrape the page for data. It works flawlessly and if everything had gone as planned, I believe it would have outputted a file as i imagined in my head.

The problem is that the page I am requesting in order to scrape, build the table im looking at asynchronously using either ajax or jsonp, i'm not entirely sure how .jsp pages work.
So here I am trying to find a way to "wait" for this data to load before I scrape the data for my new file.

var cheerio = require('cheerio'),
    request = require('request'),
    fs = require('fs');

// Go to the page in question
request({
    method: 'GET',
    url: 'http://www1.chineseshipping.com.cn/en/indices/cbcfinew.jsp'
}, function(err, response, body) {
    if (err) return console.error(err);
    // Tell Cherrio to load the HTML
    $ = cheerio.load(body);


    // Create an empty object to write to the file later    
    var toSort = {}

    // Itterate over DOM and fill the toSort object
    $('#emb table td.list_right').each(function() {
        var row = $(this).parent();     
        toSort[$(this).text()] = {
            [$("#lastdate").text()]: $(row).find(".idx1").html(),
            [$("#currdate").text()]: $(row).find(".idx2").html()            
        }
    });

    //Write/overwrite a new file
    var stream = fs.createWriteStream("/tmp/shipping.txt"); 
    var toWrite = "";

    stream.once('open', function(fd) {
        toWrite += "{\r\n"
        for(i in toSort){ 
            toWrite += "\t" + i + ": { \r\n";
                for(j in toSort[i]){
                    toWrite += "\t\t" + j + ":" + toSort[i][j] + ",\r\n";               
                }               
            toWrite += "\t" + "}, \r\n";
        }
        toWrite += "}"

        stream.write(toWrite)
        stream.end();
    });
});

The expected result is a text file with information formatted like a JSON object.

It should look something like different instances of this

"QINHUANGDAO - GUANGZHOU (50,000-60,000DWT)": { 
     "2016-09-29": 26.7,
     "2016-09-30": 26.8,
}, 

But since the name is the only thing that doesn't load async, (the dates and values are async) I get a messed up object.

I tried Actually just setting a setTimeout in various places in the code. The script will only be touched by developers that can afford to run the script several times if it fails a few times. So while not ideal, even a setTimeout (up to maybe 5 seconds) would be good enough.

It turns out the settimeouts don't work. I suspect that once I request the page, I'm stuck with the snapshot of the page "as is" when I receive it, and I'm in fact not looking at a live thing I can wait for to load its dynamic content.

I've wondered investigating how to intercept the packages as they come, but I don't understand HTTP well enough to know where to start.


Solution

  • The setTimeout will not make any difference even if you increase it to an hour. The problem here is that you are making a request against this url: http://www1.chineseshipping.com.cn/en/indices/cbcfinew.jsp

    and their server returns back the html and in this html there are the js and css imports. This is the end of your case, you just have the html and that's it. Instead the browser knows how to use and to parse the html document, so it is able to understand the javascript scripts and to execute/run them and this is exactly your problem. Your program is not able to understand that has something to do with the HTML contents. You need to find or to write a scraper that is able to run javascript. I just found this similar issue on stackoverflow: Web-scraping JavaScript page with Python

    The guy there suggests https://github.com/niklasb/dryscrape and it seems that this tool is able to run javascript. It is written in python though.