Search code examples
pythonurllib2urllib

Decoding urllib.request response


I'm getting this response when I open this url:

r = Request(r'http://airdates.tv/')
h = urlopen(r).readline()
print(h)

Response:

b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x00\xed\xbdkv\xdbH\x96.\xfa\xbbj\x14Q\xaeuJ\xce\xee4E\x82\xa4(9m\xe7\xd2\xd3VZ\xaf2e\xab2k\xf5\xc2\n'

What encoding is this? Is there a way to decode it based on the standard library?
Thank you in advance for any insight on this matter!

PS: It seems to be gzip.


Solution

  • It's gzip compressed HTML, as you suspected.

    Rather than use urllib use requests which will decompress the response for you:

    import requests
    
    r = requests.get('http://airdates.tv/')
    print(r.text)
    

    You can install it with pip install requests, and never look back.


    If you really must restrict yourself to the standard library, then decompress it with the gzip module:

    import gzip
    import urllib2
    from cStringIO import StringIO
    
    f = urllib2.urlopen('http://airdates.tv/')
    
    # how to determine the content encoding
    content_encoding = f.headers.get('Content-Encoding')
    #print(content_encoding)
    
    # how to decompress gzip data with Python 3
    if content_encoding == 'gzip':
        response = gzip.decompress(f.read())
    
    # decompress with Python 2
    if content_encoding == 'gzip':   
        gz = gzip.GzipFile(fileobj=StringIO(f.read())
        response = gz.read()