I'm trying to use python to navigate through a website that have auth forms on its landing page, rendered by ASP scripts.
But when I use python (with mechanize, requests, or urlibs) to get the HTML of that site, I always end up with a semi-blank HTML file, due to such ASP scripts.
Would anyone know any method that I can use to get the final (as displayed on a browser) version of an ASP site?
Your target page is a frameset
. There is nothing fancy going on from the server side that I can tell. When I use requests
or urllib
to download it, even sending no headers at all, I get exactly the same HTML that I see in Chrome or Firefox. There is some embedded JS, but it doesn't do anything. Basically, all there is here is a frameset
with a single frame
in it.
The frame
target is also a perfectly normal page with nothing fancy going on from the server side that I can tell. Again, if I fetch it with no headers, I get the exact same contents as in Chrome or Firefox. There is plenty of embedded JS here, but it's not building the DOM from scratch or anything; the static contents that I get from the server have the whole page contents in them. I can strip out all the JS and render it, and it looks exactly the same.
There is a minor problem that neither the server nor the HTML specifies a charset anywhere, and yet the contents aren't ASCII, which means you need to guess what charset to decode if you want to process it as Unicode. But if you're in Python 2.x, and just planning to grab things out of the DOM by ID or something, that won't matter.
I suspect your real problem is just that you don't know how HTML frameset
s work. You're downloading the frameset
, not downloading the referenced frame
, and wondering why the resulting page looks like an empty frameset
.
Frames are an obsolete feature that nobody uses anymore for anything but a common trick for letting the user pop up a new window even in ancient browsers, and some obscure tricks for fooling popup blockers. In HTML 5 they're finally gone. But as long as ancient websites are out there and need to be scraped, you need to know how they work.
This isn't a substitute for the full documentation, but here's the short version of what a web browser does with a frameset
: For each frame
tag, it follows the src
attribute, then it replaces the contents of the frame
tag with a #document
tag with no attributes, with the results of reading the src
URL as its contents. Beyond that, of course, frames affect layout, but that probably doesn't affect you.
Meanwhile, if you're trying to learn web scraping, you really want to install your browser's "Web Developer Tools" (different browsers have different names), or a full-on debugger like Firebug. That way, you can inspect the live tree that your browser is rendering, and compare it to what you get from your script (or, more simply, from wget
). So, next time you can say "In Chrome's Inspect Page, I see a #document
under the frame
, with a whole bunch of stuff underneath that, but when I try to read the same page myself, the frame
has no children".