python urllib2.urlopen - html text is garbled - why?

The printed html returns garbled text... instead of what I expect to see as seen in "view source" in browser.

Why is that? How to fix it easily?

Thank you for your help.

Same behavior using mechanize, curl, etc.

import urllib
import urllib2



start_url = "http://www.ncert.nic.in/ncerts/textbook/textbook.htm"
response = urllib2.urlopen(start_url)
html = response.read()
print html

Solution

I got the same garbled text using curl

curl http://www.ncert.nic.in/ncerts/textbook/textbook.htm

The result appears to be gzipped. So this shows the correct HTML for me.

curl http://www.ncert.nic.in/ncerts/textbook/textbook.htm | gunzip

Here's a solutions on doing this in Python: Convert gzipped data fetched by urllib2 to HTML

Edited by OP:

The revised answer after reading above is:

import urllib
import urllib2
import gzip
import StringIO

start_url = "http://www.ncert.nic.in/ncerts/textbook/textbook.htm"
response = urllib2.urlopen(start_url)
html = response.read()

data = StringIO.StringIO(html)
gzipper = gzip.GzipFile(fileobj=data)
html = gzipper.read()

html now holds the HTML (Print it to see)