Search code examples
jqueryrubyiframescreen-scraping

Screen Scraping a Fully Rendered Page


I'm trying to figure out how to capture a fully rendered page and manipulate it. I've been using Nokogiri, Hpricot, Mechanize, etc., but none can capture a page whose elements are rendered by AJAX or something else after the fact.

An example is Twitter's status page, one of many that I've come across for this project that I'm having trouble with:

http://twitter.com/#!/nytimes/status/42341419062525952

or

http://twitter.com/#!/alleyinsider/status/42337897038364672

If you look at the HTML source, it's mostly javascript that shows up rendered later. Checking it out in Firebug or another console, you see the fully rendered result, but I have no idea how to capture it with the aforementioned tools. Am I missing something?

BTW: Yes, I know there is a Twitter API. But this is more of a theoretical issue as I have hit this in varying degrees on a few other sites.

Thanks!


Solution

  • ...none can capture a page whose elements are rendered by AJAX or something else after the fact.

    That is correct. The content you seek doesn't exist in the document when captured, it's inserted as a result of the browser processing the JavaScript, which requests the content via AJAX and inserts it into the page.

    So, to get where you want to go, you'll need either a JavaScript interpreter or a browser under your code's control.

    The Watir project is capable of doing that. It's like the next step above Mechanize, except, instead of being Ruby code, it's a browser being told what to do by your Ruby code. So, the browser should be able to load the page, process the JavaScript, which then pulls in the content you're looking for.

    There are variations on Watir for the different browsers, so you can use IE, Safari, Firefox, etc.