Search code examples
javascriptnode.jsscreen-scrapingzombie.js

Node.js Scraping ASU Course


I'm pretty new to Node.js, so apologies in advance if I don't know what I'm talking about.

I'm trying to scrape some courses off ASU's course catalog (https://webapp4.asu.edu/catalog/) and have made numerous attempts using Zombie, Node.IO, and the HTTPS api. In both cases I've run into a redirect loop.

I'm wondering if it's because I'm not setting my headers properly?

Below is a sample code of what I used (not Zombie/Node.IO):

var https = require('https');

var option = {
  host: 'webapp4.asu.edu',
  path: '/catalog',
  method: 'GET',
  headers: {
    'set-cookie': 'onlineCampusSelection=C'
  }
};

var req = https.request(options, function(res) {
console.log("statusCode: ", res.statusCode);
console.log("headers: ", res.headers);
  res.on('data', function(d) {
    process.stdout.write(d);
  });
});

Just to clarify, I'm not having trouble with scraping with Node.js in general. More specifically, however, is ASU's course catalog that is giving me trouble.

Appreciate any ideas you guys could give me, thanks!

Update: My request successfully went through if I create a cookie with a JSESSIONID I got from Chrome/FF. Is there a way for me to request/create a JSESSIONID?


Solution

  • It looks like the server sets the JSESSIONID cookie and then redirects away, so you need to tell node.js not to follow redirects if you want to grab the cookie. I don't know how to do this with the http or https packages, but there is another package you can get via npm: request, which lets you do it. Here's a sample that should get you started:

    var request = require("request");
    
    var options = {
      url: "https://webapp4.asu.edu/catalog/",
      followredirect: false,
    }
    
    request.get(options, function(error, response, body) {
      console.log(response.headers['set-cookie']);
    });
    

    Output should look something like this:

    [ 'JSESSIONID=B43CC3BB09FFCDE07AE6B3B702717431.catalog1; Path=/catalog; Secure' ]