Search code examples
pythonlinuxwindowspython-3.xutf-8

Python3 utf-8 decode issue


The following code runs fine with Python3 on my Windows machine and prints the character 'é':

data = b"\xc3\xa9"

print(data.decode('utf-8'))

However, running the same on an Ubuntu based docker container results in :

UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128)

Is there anything that I have to install to enable utf-8 decoding ?


Solution

  • The problem is with the print() expression, not with the decode() method. If you look closely, the raised exception is a UnicodeEncodeError, not a -DecodeError.

    Whenever you use the print() function, Python converts its arguments to a str and subsequently encodes the result to bytes, which are sent to the terminal (or whatever Python is run in). The codec which is used for encoding (eg. UTF-8 or ASCII) depends on the environment. In an ideal case,

    • the codec which Python uses is compatible with the one which the terminal expects, so the characters are displayed correctly (otherwise you get mojibake like "é" instead of "é");
    • the codec used covers a range of characters that is sufficient for your needs (such as UTF-8 or UTF-16, which contain all characters).

    In your case, the second condition isn't met for the Linux docker you mention: the encoding used is ASCII, which only supports characters found on an old English typewriter. These are a few options to address this problem:

    • Set environment variables: on Linux, Python's encoding defaults depend on this (at least partially). In my experience, this is a bit of a trial and error; setting LC_ALL to something containing "UTF-8" worked for me once. You'll have to put them in start-up script for the shell your terminal runs, eg. .bashrc.
    • Re-encode STDOUT, like so:

      sys.stdout = open(sys.stdout.buffer.fileno(), 'w', encoding='utf8')
      

      The encoding used has to match the one of the terminal.

    • Encode the strings yourself and send them to the binary buffer underlying sys.stdout, eg. sys.stdout.buffer.write("é".encode('utf8')). This is of course much more boilerplate than print("é"). Again, the encoding used has to match the one of the terminal.
    • Avoid print() altogether. Use open(fn, encoding=...) for output, the logging module for progress info – depending on how interactive your script is, this might be worthwhile (admittedly, you'll probably face the same encoding problem when writing to STDERR with the logging module).

    There might be other options, but I doubt that there are nicer ones.