Search code examples
pythonunicodeutf-8unicode-literals

Python unicode string literals in module declared as utf-8


I have a dummie Python module with the utf-8 header that looks like this:

# -*- coding: utf-8 -*-
a = "á"
print type(a), a

Which prints:

<type 'str'> á

But I thought that all string literals inside a Python module declared as utf-8 whould automatically be of type unicode, intead of str. Am I missing something or is this the correct behaviour?

In order to get a as an unicode string I use:

a = u"á"

But this doesn't seem very "polite", nor practical. Is there a better option?


Solution

  • No, the codec at the top only informs Python how to interpret the source code, and uses that codec to interpret Unicode literals. It does not turn literal bytestrings into unicode values. As PEP 263 states:

    This PEP proposes to introduce a syntax to declare the encoding of a Python source file. The encoding information is then used by the Python parser to interpret the file using the given encoding. Most notably this enhances the interpretation of Unicode literals in the source code and makes it possible to write Unicode literals using e.g. UTF-8 directly in an Unicode aware editor.

    Emphasis mine.

    Without the codec declaration, Python has no idea how to interpret non-ASCII characters:

    $ cat /tmp/test.py 
    example = '☃'
    $ python2.7 /tmp/test.py 
      File "/tmp/test.py", line 1
    SyntaxError: Non-ASCII character '\xe2' in file /tmp/test.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
    

    If Python behaved the way you expect it to, you would not be able to literal bytestring values that contain non-ASCII byte values either.

    If your terminal is configured to display UTF-8 values, then printing a UTF-8 encoded byte string will look 'correct', but only by virtue of luck that the encodings match.

    The correct way to get unicode values, is by using unicode literals or by otherwise producing unicode (decoding from byte strings, converting integer codepoints to unicode characters, etc.):

    unicode_snowman = '\xe2\x98\x83'.decode('utf8')
    unicode_snowman = unichr(0x2603)
    

    In Python 3, the codec also applies to how variable names are interpreted, as you can use letters and digits outside of the ASCII range in names. The default codec in Python 3 is UTF-8, as opposed to ASCII in Python 2.