Search code examples
pythonhtmlurllib2mechanize-python

python urllib2.urlopen - html text is garbled - why?


The printed html returns garbled text... instead of what I expect to see as seen in "view source" in browser.

Why is that? How to fix it easily?

Thank you for your help.

Same behavior using mechanize, curl, etc.

import urllib
import urllib2



start_url = "http://www.ncert.nic.in/ncerts/textbook/textbook.htm"
response = urllib2.urlopen(start_url)
html = response.read()
print html

Solution

  • I got the same garbled text using curl

    curl http://www.ncert.nic.in/ncerts/textbook/textbook.htm
    

    The result appears to be gzipped. So this shows the correct HTML for me.

    curl http://www.ncert.nic.in/ncerts/textbook/textbook.htm | gunzip
    

    Here's a solutions on doing this in Python: Convert gzipped data fetched by urllib2 to HTML

    Edited by OP:

    The revised answer after reading above is:

    import urllib
    import urllib2
    import gzip
    import StringIO
    
    start_url = "http://www.ncert.nic.in/ncerts/textbook/textbook.htm"
    response = urllib2.urlopen(start_url)
    html = response.read()
    
    data = StringIO.StringIO(html)
    gzipper = gzip.GzipFile(fileobj=data)
    html = gzipper.read()
    

    html now holds the HTML (Print it to see)