I'm now on Linux Mint 13 Xfce 32-bit, 3.2.0-7
with Python 2.7.3
. I'm simply trying to only read the source code of the webpage protected by HTTPS. Here's my little program:
#!/usr/bin/env python
import mechanize
browser = mechanize.Browser()
browser.set_handle_robots(False)
browser.set_handle_equiv(False)
browser.addheaders = [('User-Agent',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'),
('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'),
('Accept-Encoding', 'gzip, deflate, sdch'),
('Accept-Language', 'en-US,en;q=0.8,ru;q=0.6'),
('Cache-Control', 'max-age=0'),
('Connection', 'keep-alive')]
html = browser.open('https://scholar.google.com/citations?view_op=search_authors')
print html.read()
But instead of the source code of the page, I see only something like this:
What's the problem and how to fix it? I need to use mechanize, since I will need to play with this page later on.
Your code works for me, but I would remove the line
('Accept-Encoding', 'gzip, deflate, sdch'),
to not having to reverse that encoding afterwards. To clarify: you are getting the content, but you expect it to be in "clear text". You get clear text by not requesting gzipped content.