Search code examples
pythonunicodecharacter-encodingpython-2.xstring-literals

python u'\u00b0' returns u'\xb0'. Why?


I use python 2.7.10.

On dealing with character encoding, and after reading a lot of stack-overflow etc. etc. on the subject, I encountered this behaviour which looks strange to me. Python interpreter input

>>>u'\u00b0'

results in the following output:

u'\xb0'

I could repeat this behaviour using a dos window, the idle console, and the wing-ide python shell.

My assumptions (correct me if I am wrong): The "degree symbol" has unicode 0x00b0, utf-8 code 0xc2b0, latin-1 code 0xb0. Python doc say, a string literal with u-prefix is encoded using unicode.

Question: Why is the result converted to a unicode-string-literal with a byte-escape-sequence which matches the latin-1 encoding, instead of persisting the unicode escape sequence ?

Thanks in advance for any help.


Solution

  • Python uses some rules for determining what to output from repr for each character. The rule for Unicode character codepoints in the 0x0080 to 0x00ff range is to use the sequence \xdd where dd is the hex code, at least in Python 2. There's no way to change it. In Python 3, all printable characters will be displayed without converting to a hex code.

    As for why it looks like Latin-1 encoding, it's because Unicode started with Latin-1 as the base. All the codepoints up to 0xff match their Latin-1 counterpart.