I would like to know if there is something like Scrapy for nodejs ?. if not what do you think of using the simple page download and parsing it using cheerio ? is there a better way.
I haven't seen such a strong solution for crawling / indexing whole websites like Scrapy in python, so personally I use Python Scrapy for crawling websites.
But for scraping data from pages there is casperjs in nodejs. It is a very cool solution. It also works for ajax websites, e.g. angular-js pages. Python Scrapy cannot parse ajax pages. So for scraping data for one or few pages I prefer to use CasperJs.
Cheerio is really faster than casperjs, but it doesn't work with ajax pages and it doesn't have such good structure of a code like casperjs. So I prefer casperjs even when you can use cheerio package.
Coffee-script example:
casper.start 'https://reports.something.com/login', ->
this.fill 'form',
username: params.username
password: params.password
, true
casper.thenOpen queryUrl, {method:'POST', data:queryData}, ->
this.click 'input'
casper.then ->
get = (number) =>
value = this.fetchText("tr[bgcolor= '#AFC5E4'] > td:nth-of-type(#{number})").trim()