Search code examples
pythonunicodeencodingasciipython-2.x

Why does Python print unicode characters when the default encoding is ASCII?


From the Python 2.6 shell:

>>> import sys
>>> print sys.getdefaultencoding()
ascii
>>> print u'\xe9'
é
>>> 

I expected to have either some gibberish or an Error after the print statement, since the "é" character isn't part of ASCII and I haven't specified an encoding. I guess I don't understand what ASCII being the default encoding means.

EDIT

I moved the edit to the Answers section and accepted it as suggested.


Solution

  • Thanks to bits and pieces from various replies, I think we can stitch up an explanation.

    When trying to print a Unicode string, u'\xe9', Python implicitly attempts to encode that string using the scheme currently stored in sys.stdout.encoding. Python actually picks up this setting from the environment it's been initiated from. If it can't find a proper encoding from the environment, only then does it revert to its default, ASCII.

    For example, I use a bash shell whose encoding defaults to UTF-8. If I start Python from it, it picks up and uses that setting:

    $ python
    
    >>> import sys
    >>> print sys.stdout.encoding
    UTF-8
    

    Let's for a moment exit the Python shell and set bash's environment with some bogus encoding:

    $ export LC_CTYPE=klingon
    # we should get some error message here, just ignore it.
    

    Then start the python shell again and verify that it does indeed revert to its default ASCII encoding.

    $ python
    
    >>> import sys
    >>> print sys.stdout.encoding
    ANSI_X3.4-1968
    

    Bingo!

    If you now try to output some unicode character outside of ASCII you should get a nice error message

    >>> print u'\xe9'
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' 
    in position 0: ordinal not in range(128)
    

    Lets exit Python and discard the bash shell.

    We'll now observe what happens after Python outputs strings. For this we'll first start a bash shell within a graphic terminal (I'll use Gnome Terminal). We'll set the terminal to decode output with ISO-8859-1 aka Latin-1 (graphic terminals usually have an option to Set Character Encoding in one of their dropdown menus). Note that this doesn't change the actual shell environment's encoding, it only changes the way the terminal itself will decode output it's given, a bit like a web browser does. You can therefore change the terminal's encoding, independently from the shell's environment. Let's then start Python from the shell and verify that sys.stdout.encoding is set to the shell environment's encoding (UTF-8 for me):

    $ python
    
    >>> import sys
    
    >>> print sys.stdout.encoding
    UTF-8
    
    >>> print '\xe9' # (1)
    é
    >>> print u'\xe9' # (2)
    é
    >>> print u'\xe9'.encode('latin-1') # (3)
    é
    >>>
    

    (1) python outputs binary string as is, terminal receives it and tries to match its value with Latin-1 character map. In Latin-1, 0xe9 or 233 yields the character "é" and so that's what the terminal displays.

    (2) python attempts to implicitly encode the Unicode string with whatever scheme is currently set in sys.stdout.encoding, in this instance it's UTF-8. After UTF-8 encoding, the resulting binary string is '\xc3\xa9' (see later explanation). Terminal receives the stream as such and tries to decode 0xc3a9 using Latin-1, but Latin-1 goes from 0 to 255 and so, only decodes streams 1 byte at a time. 0xc3a9 is 2 bytes long, the Latin-1 decoder therefore interprets it as two distinct bytes, 0xc3 (195) and 0xa9 (169), which yield the respective characters 'Ã' and '©'.

    (3) python encodes Unicode code point u'\xe9' (233) with the Latin-1 scheme. It turns out the Latin-1 code point range is 0-255 and it points to the exact same characters as Unicode does within that range. Therefore, Unicode code points between 0-255 will yield the same value when encoded in Latin-1. So u'\xe9' (233) encoded in Latin-1 will also yields the binary string '\xe9'. Terminal receives that value and tries to match it to its Latin-1 character map. Just like case (1), that yields "é" and that's what's displayed.

    Let's now change the terminal's encoding settings to UTF-8 from the dropdown menu (like you would change your web browser's encoding settings). No need to stop Python or restart the shell. The terminal's encoding now matches Python's. Let's try printing again:

    >>> print '\xe9' # (4)
    
    >>> print u'\xe9' # (5)
    é
    >>> print u'\xe9'.encode('latin-1') # (6)
    
    >>>
    

    (4) python outputs a binary string as is. Terminal attempts to decode that stream with UTF-8. But UTF-8 doesn't understand the value 0xe9 (see later explanation) and is therefore unable to convert it to a Unicode code point. No code point found, no character printed.

    (5) python attempts to implicitly encode the Unicode string with whatever sys.stdout.encoding is currently set (still UTF-8). The resulting binary string is '\xc3\xa9'. Terminal receives the stream and attempts to decode 0xc3a9 also using UTF-8. It yields back code value 0xe9 (233), which on the Unicode character map points to the symbol "é". Terminal displays "é".

    (6) python encodes Unicode string with Latin-1, it yields a binary string with the same value '\xe9'. Again, for the terminal this is pretty much the same as case (4).

    Conclusions:

    • Python outputs non-Unicode strings as raw data, without considering its default encoding. The terminal just happens to display them if its current encoding matches the data.
    • Python outputs Unicode strings after encoding them using the scheme specified in sys.stdout.encoding.
    • Python gets that setting from the shell's environment.
    • the terminal displays output according to its own encoding settings.
    • the terminal's encoding is independant from the shell's.

    More details on specifically Unicode, UTF-8 and Latin-1

    Unicode is fundamentally a character table, where some keys (code points) have been conventionally assigned to specific symbols. For example, by convention it's been decided that hexadecimal key 0xe9 (decimal 233) points to the symbol 'é'. ASCII and Unicode use the same code points from 0 to 127, as do Latin-1 and Unicode from 0 to 255. That is, 0x41 (dec 65) points to 'A' in ASCII, Latin-1 and Unicode, 0xc8 points to 'Ü' in Latin-1 and Unicode, 0xe9 points to 'é' in Latin-1 and Unicode.

    When dealing with electronics, Unicode code points require an efficient representation scheme. That's what encodings are about. Various Unicode encoding schemes exist (UTF-7, UTF-8, UTF-16, UTF-32). The most intuitive and straight forward encoding approach would be to simply use a code point's value in the Unicode map as its value for its electronic form, but Unicode currently has over a million code points, which means that some of them require 3 bytes to be expressed. To work efficiently with text, a 1 to 1 mapping would be rather impractical, since it would require that all code points be stored in exactly the same amount of space, with a minimum of 3 bytes per character, regardless of their actual need.

    Most encoding schemes have shortcomings regarding space requirement, the most economic ones leave out many Unicode code points. ASCII for example, only covers the first 128 Unicode code points and Latin-1, only the first 256. Other encodings that try to be more comprehensive end up also being wasteful, since they require more byte space than necessary, for even "cheap" code points. UTF-16 for instance, uses a minimum of 2 bytes per code point, including those in the ASCII range that normally only require one byte (e.g. 'B' which is 66, still requires 2 bytes of storage in UTF-16). UTF-32 is even more wasteful as it stores all code points in 4 bytes.

    The UTF-8 scheme (surprisingly more recent than UTF-16 and UTF-32) happens to have cleverly mitigated the dilemma. It's able to store code points with a variable amount of byte spaces. As part of its encoding strategy, UTF-8 laces code points with flag bits that indicate (presumably to decoders) their space requirements and their boundaries.

    UTF-8 encoding of Unicode code points in the ASCII range (0-127)
    0xxx xxxx  (in binary)
    
    • The x's show the actual space reserved to "store" the code point during encoding.
    • The leading 0 is a flag that indicates to the UTF-8 decoder that this code point will only require 1 byte.
    • Upon encoding, UTF-8 doesn't change the value of Unicode code points in that specific range (i.e. Unicode 65 encoded in UTF-8 is also 65). Considering that ASCII is also compatible with Unicode in that range, it incidentally makes ASCII compatible with UTF-8 (for that range).

    E.g. The Unicode code point for 'B' is '0x42' (66 in decimal), or 0100 0010 in binary. As said previously it's the same in ASCII. Here's a description of its UTF-8 encoding:

    0xxx xxxx  <-- UTF-8 wrapper for Unicode code points in the range 0 - 127
    *100 0010  <-- Unicode code point 0x42
    0100 0010  <-- UTF-8 encoded (exactly the same)
    
    UTF-8 wrappers for Unicode code points above 127 (beyond-ASCII)
    110x xxxx 10xx xxxx            <-- (from 128 to 2047)
    1110 xxxx 10xx xxxx 10xx xxxx  <-- (from 2048 to 65535)
    
    • A leading 110 flag bits indicate to the UTF-8 decoder the start of a code point encoded in 2 bytes, whereas a leading 1110 indicates 3 bytes, 11110 would indicate 4 bytes and so forth.
    • A leading 10 flag bits is used to signal the start of an inner byte.
    • As seen previously, the x's mark the space where the Unicode code point value is stored during encoding.

    E.g. 'é' Unicode code point is 0xe9 (233).

    1110 1001    <-- 0xe9
    

    To encode this code point in UTF-8, it's determined that since its value is larger than 127 and less than 2048, it should be encoded with a 2-byte UTF-8 wrapper:

    110x xxxx 10xx xxxx   <-- 2-byte UTF-8 wrapper for Unicode 128-2047
    ***0 0011 **10 1001   <-- 0xe9
    1100 0011 1010 1001   <-- 'é' after UTF-8 encoding
    C    3    A    9
    

    The 0xe9 Unicode code points after UTF-8 encoding becomes 0xc3a9. Which is exactly how the terminal receives it. If your terminal is set to decode strings using Latin-1, you'll see 'é', because it just so happens that 0xc3 in Latin-1 points to à while 0xa9 points to ©.