If I have a string that I want to use in byte form encoded as UTF-8, do I need to encode the variable as a byte variable? Or, since Python is by default encoded as UTF-8, will it just treat the string as UTF-8 byte form in certain contexts without explicit encoding?
For example, I'm working on a project where I have an array of dictionaries that map strings to strings. If I write this array to a file with json.dump and then read it with json.load, the strings are recovered just fine, and I get no error, despite never encoding. This indicates to me that if you're just using UTF-8, you don't actually need to convert to byte form. Am I wrong? If I'm right, is this bad practice nonetheless? Would my example be any different if I were just writing strings without the JSON?
Python has multiple defaults regarding encoding. In Python 3, the situation is as follows:
# coding: latin-1
) if you really have to. It only affects string literals (and variable names).encoding
parameter of str.encode()
and bytes.decode()
is UTF-8 too.open()
, then the default for encoding
depends on the circumstances (OS, env variables, Python version, build). You can check its value with locale.getpreferredencoding()
. This default is also used when you read from sys.stdin
or use print()
.So I'd say it's okay to rely on the defaults for the first two cases (it's officially recommended for the first one).
But the third one is tricky: The IO default is UTF-8 on many systems, so you might think that with open(path) as f:
will always use UTF-8, because it did so during development, but then you port the script to a different server and suddenly it raises UnicodeErrors or produces gibberish.
It's often not necessary to deal with encoded strings (ie. bytes
objects) for processing text.
Rather, you make sure to have it decoded when reading and encoded when writing/sending the text.
This is done automatically for streams created with open()
(unless you specify binary mode 'rb'
/'wb'
).
If you think input/output has to be UTF-8, then you should explicitly specify encoding='utf8'
when calling open()
.