Search code examples
pythonunicodeutf-8latin1

latin-1 vs unicode in python


I was reading this high rated post in SO on unicodes

Here is an `illustration given there :

$ python

>>> import sys

>>> print sys.stdout.encoding
UTF-8

>>> print '\xe9' # (1)
é
>>> print u'\xe9' # (2)
é
>>> print u'\xe9'.encode('latin-1') # (3)
é
>>>

and the explanation were given as

(1) python outputs binary string as is, terminal receives it and tries to match its value with latin-1 character map. In latin-1, 0xe9 or 233 yields the character "é" and so that's what the terminal displays.

My question is: why does the terminal match to the latin-1 character map when the encoding is 'UTF-8'?

Also when I tried

>>> print '\xe9'
?
>>> print u'\xe9'
é

I get different result for the first one than what is described above. why is this discrepancy and where does latin-1 come to play in this picture?


Solution

  • You are missing some important context; in that case the OP configured the terminal emulator (Gnome Terminal) to interpret output as Latin-1 but left the shell variables set to UTF-8. Python thus is told by the shell to use UTF-8 for Unicode output but the actual configuration of the terminal is to expect Latin-1 bytes.

    The print output clearly shows the terminal is interpreting output using Latin-1, and is not using UTF-8.

    When a terminal is set to UTF-8, the \xe9 byte is not valid (incomplete) UTF-8 and your terminal usually prints a question mark instead:

    >>> import sys
    >>> sys.stdout.encoding
    'UTF-8'
    >>> print '\xe9'
    ?
    >>> print u'\xe9'
    é
    >>> print u'\xe9'.encode('utf8')
    é
    

    If you instruct Python to ignore such errors, it gives you the U+FFFD REPLACEMENT CHARACTER glyph instead:

    >>> '\xe9'.decode('utf8')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
        return codecs.utf_8_decode(input, errors, True)
    UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 0: unexpected end of data
    >>> '\xe9'.decode('utf8', 'replace')
    u'\ufffd'
    >>> print '\xe9'.decode('utf8', 'replace')
    �
    

    That's because in UTF-8, \xe9 is the start byte of a 3-byte encoding, for the Unicode codepoints U+9000 through to U+9FFF, and if printed as just a single byte is invalid. This works:

    >>> print '\xe9\x80\x80'
    退
    

    because that's the UTF-8 encoding of the U+9000 codepoint, a CJK Ideograph glyph.

    If you want to understand the difference between encodings and Unicode, and how UTF-8 and other codecs work, I strongly recommend you read: