Search code examples
pythonstringpython-3.xunicodelowercase

python3 unicode fails when lowercasing


I'm facing a very strange issue with Python 3:

>>> a = "abcé"
>>> a
'abcé'
>>> print(a)
abcé
>>> print(a.lower())
abc�

I have no idea where this comes from, but it fails to lowercase unicode characters. Note that I'm not able to reproduce the bug everywhere, this is just on one of my computers that I get the following issue. Also, python2 on that same computer properly prints abcé.

Also, a.upper() is returning ABCé instead of ABCÉ, so it does not suffer the same issue than lower..

Any ideas?


Solution

  • The behaviour you observed is indeed very peculiar: the letter "é" being unaffected by Python's str.upper(), and turned into a replacement character by str.lower(). It is also strange that this appears to depend on the environment, since the said str methods don't exhibit any localisation (even though this would arguably make sense sometimes, as in the case of the Turkish mapping of "i"→"İ" and "ı"→"I"), but always use Unicode's default case algorithm.

    Possible explanation

    The most likely explanation for this weird phenomenon is that Python doesn't "see" the same data as you. As Hansaplast wrote in their answer, there's probably an encoding mismatch between the terminal and the Python interpreter. One usually doesn't have to care about this, but when you use the interactive interpreter, the job of displaying typed and printed characters isn't actually performed by Python, but by the terminal [emulator], and this additional layer can be a source of problems sometimes.

    So what exactly is going on? I believe the following scenario can explain the observed behaviour:

    • Your terminal is configured to use UTF-8. When you type "é", it will send the bytes C3 A9 to Python. When it receives C3 A9 from Python, it will display "é".
    • Python, however, uses Latin-1, as you confirmed through the return value of locale.getlocale(). When it receives C3 A9, it decodes this to "é", which is a common case of mojibake.
    • UTF-8 and Latin-1 are both supersets of ASCII, so as long as you only use ASCII characters, this misconfiguration is not an issue. When you type "A", Python reads "A", and the same for output.

    The really nasty thing about this misconfiguration is that it is only visible in certain circumstances. Even non-ASCII characters might pass through unnoticed because of the symmetry of en-/decoding. If Python simply echoes its input, ie. prints "é", this will be de-mojibaked by the terminal into "é", so the mistake is hidden. But when the individual characters are interpreted in some way – as with str.upper() and lower() –, unexpected things might occur.

    In your case, .upper() has no effect, because "Ã" is upper-case already and "©" is caseless. That is why 'abcé'.upper() results in 'ABCé' on the screen. But lowercasing produces "ã©", which Python encodes as E3 A9. Since this is not a valid UTF-8 byte sequence, the terminal fails at interpreting it and shows a replacement character (�) instead.

    Solutions

    If this explanation is true, how do you fix the encoding misconfiguration?

    • For interactive sessions, it probably makes sense that Python uses environment variables such as LC_ALL for setting the encoding of STDIN/STDOUT. Put a line like export LC_ALL=en_US.utf8 in a start-up script for the shell that runs in your terminal, eg. .bashrc. Changing the locale from within Python has no effect because the STD-stream encoding is set on start-up and won't be updated when you call locale.selocale().

    • For scripts, you might not want to rely on environment variables. You can create a new io.TextIOWrapper around the binary stream underlying each standard channel:

      sys.stdin = open(sys.stdin.buffer.fileno(), encoding='utf8')
      sys.stdout = open(sys.stdout.buffer.fileno(), 'w', encoding='utf8')
      sys.stderr = open(sys.stderr.buffer.fileno(), 'w', encoding='utf8')
      

      (I don't recommend this solution for interactive sessions. Especially if you mistype something, you can get in a situation from where it's difficult to recover.)