I am trying to get and parse a webpage that contains non-ASCII characters (the URL is http://www.one.co.il). This is what I have:
url = "http://www.one.co.il"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
encoding = response.headers.getparam('charset') # windows-1255
html = response.read() # The length of this is valid - about 31000-32000,
# but printing the first characters shows garbage -
# '\x1f\x8b\x08\x00\x00\x00\x00\x00', instead of
# '<!DOCTYPE'
html_decoded = html.decode(encoding)
The last line gives me an exception:
File "C:/Users/....\WebGetter.py", line 16, in get_page
html_decoded = html.decode(encoding)
File "C:\Python27\lib\encodings\cp1255.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0xdb in position 14: character maps to <undefined>
I tried looking at other related questions such as urllib2 read to Unicode and How to handle response encoding from urllib.request.urlopen() , but didn't find anything helpful about this.
Can someone please shed some light and guide me in this subject? Thanks!
0x1f 0x8b 0x08 is the magic number for a gzipped file. You will need to decompress it before you can use the contents.