Search code examples
pythonmechanizescraper

Using mechanize with a hidden log-in page


I want to write a scraper to pull pdfs from a database of police reports, but I've run into a snag. When I click the page's "Log In" button, it doesn't bring up a separate URL, it just loads the log-in page asynchronously. I'm not sure how it does this - I've watched the Net tab in my console but the page doesn't seem to be making any XHR requests.

I was planning to write my scraper in Python, so I'd like to use the mechanize library to log in and crawl through the pdfs. But before I can do any of that, I've got to find that pesky log-in page!


Solution

  • If the web page is doing a lot of Ajax type of activity, then you really can't just use an HTML parser - you also need a Javascript interpreter.

    You'll likely be forced to use something like Selenium or PhantomJS.