Search code examples
python-3.xunicodeutf-8cp1252

Why can't I decode \xDF (ß) into UTF-8?


I have a bytestring b"\xDF". When I try to decode it to UTF-8, a UnicodeDecodeError is thrown. Decoding to CP1252 works fine. In both charsets, 0xDF is represented by the character "ß". So why the Error?

>>> hex(ord("ß"))
'0xdf'
>>> b"\xDF".decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdf in position 0: unexpected end of data
>>> b"\xDF".decode("cp1252")
'ß'

Solution

  • All single-byte encoded characters in UTF-8 have to be in the range [0x00 .. 0x7F] (https://en.wikipedia.org/wiki/UTF-8). Those are equivalent to 7-bit ASCII.

    For the german ß, you'd get 2 bytes in UTF-8:

    >>> "ß".encode("utf-8")
    

    b'\xc3\x9f'

    Which also works correctly when decoding:

    b'\xc3\x9f'.decode("utf-8")
    

    'ß'