Search code examples
httpcurlscreen-scrapinglynxphantomjs

How can I screen-scrape the HTML result of a non-trivial user scenario


I want to be able to get the HTML for a page which, if I was doing it interactively in a browser, would involve multiple actions and page loads: 1. Go to homepage 2. Enter text into a login form and submit the form (post) 3. The post will go through various redirections and frameset usage.

Cookies are adapted throughout this process.

In the browser, after submitting, I just get the page.

But to do this with curl (in PHP or whatever) or wget or A.N.Other low level technology, the management of cookies, redirections and framesets all becomes quite a chore and very tightly binds my script to the website (making it very susceptible to even small changes in the website that I'm scraping from.)

Can anyone suggest a way to do this?

I've already looked at Crowbar and PhantomJS and Lynx (with cmd_log/cmd_script options) but chaining everything together to mimic exactly what I'd do in Firefox or Chrome is difficult.

(As an aside, it might even be useful/necessary for the target website to think this script is Firefox or Chrome or a "real" browser)


Solution

  • One way to do this is using Selenium RC. While it's usually used for testing, at it's core it's just a browser remote control service.

    Use this web site as a starting point: http://seleniumhq.org/projects/remote-control/