Search code examples
pythonunicodeescapingurllib2

Python gets the wrong encoding for UTF-8 characters?


I'm trying to fetch text with special characters from a website, and the string Python returns is therefore full of "\x" characters. However, it seems that the encoding is wrong. For example, when fetching :

th =urllib2.urlopen('http://norse.ulver.com/dct/zoega/th.html')

the line at level <h1> of the webpage should contain the letter "Þ", which has byte number C39E and Unicode code DE according to http://www.fileformat.info/info/charset/UTF-8/list.htm

Instead, I get

'<h1>\xc3\x9e</h1>'

with the byte number split in two, so that when writing the line to a file and then opening it with a Unicode encoding, I get "Þ" instead of "Þ".

How can I force Python to encode such a character as \uC39E or \xde instead of \xc3\x9e ?


Solution

  • That's the correct UTF-8 byte encoding of U+00DE and it takes two bytes to represent it (\xc3 and \x9e), but you need to decode it to Unicode to see the Unicode codepoint. In Python 3 ascii() will show non-ASCII code points as escape codes:

    >>> print(ascii(b'<h1>\xc3\x9e</h1>'.decode('utf8')))
    '<h1>\xde</h1>'
    

    The above is a Unicode string showing the correct Unicode codepoint. Displaying it in Python 3:

    >>> b'<h1>\xc3\x9e</h1>'.decode('utf8')
    '<h1>Þ</h1>'
    

    If you use the wrong encoding to decode you get different Unicode codepoints. In this case U+00C3 and U+017E. \xc3 is an escape code in a Unicode string for Unicode codepoints < U+0100 whereas \u017E is one for codepoints < U+10000:

    >>> print(ascii(b'<h1>\xc3\x9e</h1>'.decode('cp1252')))
    '<h1>\xc3\u017e</h1>'
    >>> b'<h1>\xc3\x9e</h1>'.decode('cp1252')
    '<h1>Þ</h1>'
    

    Recommended reading: