Search code examples
pythondockerutf-8

Why is Python3 breaking on UTF-8 when Python2 does not, all else equal?


I have a Docker container with centos7 as the base image and both python --version = Python 2.7.5 and python3 --version = Python 3.6.8.

My ENTRYPOINT is a short run.sh file.

#!/bin/bash
locale >> locale.txt
python3 /home/scripts/script.py >> output.txt
while :
do
    sleep 10000
done

script.py is:

#!/bin/python
path_to_file = '/home/file.json'
print('starting')
try:
    with open(path_to_file, "r") as file:
        data = file.read()
        print(data)
except Exception as e: print(e)
print('ending')

Finally, /home/file.json just contains:

"°"

After launching the container and entering it with docker exec -it container-name bash, I check output.txt and its contents are:

starting
'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128)
ending

Changing python3 to python in run.sh and redoing the process results in output.txt having:

starting
"°"
ending

In both cases, locale.txt has:

LANG=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

However, running locale in my terminal while in the Docker container's bash session gives me:

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Running python3 script.py in the bash session gives me:

starting
"°"
ending

What I'm finding in the documentation is that Python 2 reads from the system configuration and uses ASCII if the system does not have a character set configured. It looks like Python 3 behaves the same way when the encoding parameter is not given to the open function. I must be misunderstanding the documentation somehow?

What is the difference between Python 2 and Python 3 that causes Python 2 to succeed and Python 3 to fail when run in the Docker container's ENTRYPOINT script?


Solution

  • Python 2 simply does not attempt to handle the encoding at all. It naively reads byte by byte.

    Python 3, by contrast, distinguishes between files opened in binary mode (which reads literally byte by byte, and returns a bytes object), and text mode (which attempts to use an encoding, and fails if the file contains sequences which are not valid in that encoding, and returns a str if successful).

    You have not identified the encoding of your file; if it is a valid UTF-8 file (which is what generally I would recommend), use encoding="utf-8". On Windows, you might need a different encoding; but of course, you need to understand character encodings, and specify the correct one.

    To force Python 3 to use a specific encoding, without changing the source file, you can set the environment variable PYTHONIOENCODING to a suitable value; e.g.

    export PYTHONIOENCODING="utf-8"
    

    See also https://nedbatchelder.com/text/unipain.html and Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excluses!)

    If you need help identifying the encoding of your text file, try https://tripleee.github.io/8bit/ (full disclosure: I am the author of this page) armed with a hex dump or Python 3's repr of the bytes in the file (... though it won't help in the case of UTF-8, so maybe try that first; maybe search for U+00B0 which is the Unicode code point for the degree sign. Its UTF-8 representation is the two bytes b'\xc2\xb0' which is suggested by the error message you got).