I'm trying to fetch text with special characters from a website, and the string Python returns is therefore full of "\x" characters. However, it seems that the encoding is wrong. For example, when fetching :
th =urllib2.urlopen('http://norse.ulver.com/dct/zoega/th.html')
the line at level <h1>
of the webpage should contain the letter "Þ", which has byte number C39E and Unicode code DE according to http://www.fileformat.info/info/charset/UTF-8/list.htm
Instead, I get
'<h1>\xc3\x9e</h1>'
with the byte number split in two, so that when writing the line to a file and then opening it with a Unicode encoding, I get "Þ" instead of "Þ".
How can I force Python to encode such a character as \uC39E
or \xde
instead of \xc3\x9e
?
That's the correct UTF-8 byte encoding of U+00DE and it takes two bytes to represent it (\xc3
and \x9e
), but you need to decode it to Unicode to see the Unicode codepoint. In Python 3 ascii()
will show non-ASCII code points as escape codes:
>>> print(ascii(b'<h1>\xc3\x9e</h1>'.decode('utf8')))
'<h1>\xde</h1>'
The above is a Unicode string showing the correct Unicode codepoint. Displaying it in Python 3:
>>> b'<h1>\xc3\x9e</h1>'.decode('utf8')
'<h1>Þ</h1>'
If you use the wrong encoding to decode you get different Unicode codepoints. In this case U+00C3 and U+017E. \xc3
is an escape code in a Unicode string for Unicode codepoints < U+0100 whereas \u017E
is one for codepoints < U+10000:
>>> print(ascii(b'<h1>\xc3\x9e</h1>'.decode('cp1252')))
'<h1>\xc3\u017e</h1>'
>>> b'<h1>\xc3\x9e</h1>'.decode('cp1252')
'<h1>Þ</h1>'
Recommended reading: