Search code examples
pythonunicodeutf-8bytefile-writing

If I want to use UTF-8 encoding, the default for python, do I have to encode my string variables to byte variables?


If I have a string that I want to use in byte form encoded as UTF-8, do I need to encode the variable as a byte variable? Or, since Python is by default encoded as UTF-8, will it just treat the string as UTF-8 byte form in certain contexts without explicit encoding?

For example, I'm working on a project where I have an array of dictionaries that map strings to strings. If I write this array to a file with json.dump and then read it with json.load, the strings are recovered just fine, and I get no error, despite never encoding. This indicates to me that if you're just using UTF-8, you don't actually need to convert to byte form. Am I wrong? If I'm right, is this bad practice nonetheless? Would my example be any different if I were just writing strings without the JSON?


Solution

  • Python has multiple defaults regarding encoding. In Python 3, the situation is as follows:

    • The source file encoding is UTF-8 by default. You can override this with a comment in one of the first two lines of the module (# coding: latin-1) if you really have to. It only affects string literals (and variable names).
    • The encoding parameter of str.encode() and bytes.decode() is UTF-8 too.
    • But when you open a file with open(), then the default for encoding depends on the circumstances (OS, env variables, Python version, build). You can check its value with locale.getpreferredencoding(). This default is also used when you read from sys.stdin or use print().

    So I'd say it's okay to rely on the defaults for the first two cases (it's officially recommended for the first one). But the third one is tricky: The IO default is UTF-8 on many systems, so you might think that with open(path) as f: will always use UTF-8, because it did so during development, but then you port the script to a different server and suddenly it raises UnicodeErrors or produces gibberish.

    It's often not necessary to deal with encoded strings (ie. bytes objects) for processing text. Rather, you make sure to have it decoded when reading and encoded when writing/sending the text. This is done automatically for streams created with open() (unless you specify binary mode 'rb'/'wb'). If you think input/output has to be UTF-8, then you should explicitly specify encoding='utf8' when calling open().