I have a Docker container with centos7 as the base image and both python --version
= Python 2.7.5
and python3 --version
= Python 3.6.8
.
My ENTRYPOINT is a short run.sh
file.
#!/bin/bash
locale >> locale.txt
python3 /home/scripts/script.py >> output.txt
while :
do
sleep 10000
done
script.py
is:
#!/bin/python
path_to_file = '/home/file.json'
print('starting')
try:
with open(path_to_file, "r") as file:
data = file.read()
print(data)
except Exception as e: print(e)
print('ending')
Finally, /home/file.json
just contains:
"°"
After launching the container and entering it with docker exec -it container-name bash
, I check output.txt
and its contents are:
starting
'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128)
ending
Changing python3
to python
in run.sh
and redoing the process results in output.txt
having:
starting
"°"
ending
In both cases, locale.txt has:
LANG=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
However, running locale
in my terminal while in the Docker container's bash session gives me:
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
Running python3 script.py
in the bash session gives me:
starting
"°"
ending
What I'm finding in the documentation is that Python 2 reads from the system configuration and uses ASCII if the system does not have a character set configured. It looks like Python 3 behaves the same way when the encoding
parameter is not given to the open
function. I must be misunderstanding the documentation somehow?
What is the difference between Python 2 and Python 3 that causes Python 2 to succeed and Python 3 to fail when run in the Docker container's ENTRYPOINT script?
Python 2 simply does not attempt to handle the encoding at all. It naively reads byte by byte.
Python 3, by contrast, distinguishes between files opened in binary mode (which reads literally byte by byte, and returns a bytes
object), and text mode (which attempts to use an encoding, and fails if the file contains sequences which are not valid in that encoding, and returns a str
if successful).
You have not identified the encoding of your file; if it is a valid UTF-8 file (which is what generally I would recommend), use encoding="utf-8"
. On Windows, you might need a different encoding; but of course, you need to understand character encodings, and specify the correct one.
To force Python 3 to use a specific encoding, without changing the source file, you can set the environment variable PYTHONIOENCODING
to a suitable value; e.g.
export PYTHONIOENCODING="utf-8"
See also https://nedbatchelder.com/text/unipain.html and Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excluses!)
If you need help identifying the encoding of your text file, try https://tripleee.github.io/8bit/ (full disclosure: I am the author of this page) armed with a hex dump or Python 3's repr
of the bytes in the file (... though it won't help in the case of UTF-8, so maybe try that first; maybe search for U+00B0 which is the Unicode code point for the degree sign. Its UTF-8 representation is the two bytes b'\xc2\xb0'
which is suggested by the error message you got).