Search code examples
pythonxmlencodingpipewc

UnicodeEncodeError if piping output to wc -l


When running the code:

#! /usr/bin/env python
# -*- coding: UTF-8 -*-

import xml.etree.ElementTree as ET
print ET.fromstring('<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root><road>vägen</road></root>').find('road').text 

Produces the expected output vägen, however if piping this to wc -l I get a UnicodeEncodeError, e.g. (TEerr.py holds the code snippet given above):

:~> ETerr.py | wc -l
Traceback (most recent call last):
  File "./ETerr.py", line 5, in <module>
    print ET.fromstring('<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root><road>vägen</road></root>').find('road').text 
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 1: ordinal not in range(128)
0
:~> 

How can the code behave differently if its output is piped or not and how can I fix it so that it doesn't.

Please note that the code snippet above is merely set up to demonstrate the issue with as little code as possible, in the actual script where I need to resolve the issue the xml is retrieved using urllib hence I have little control over its format.


Solution

  • First, let me point out that this is not a problem in Python 3, and fixing it is in fact one of the reasons that it was worth a compatibility-breaking change to the language in the first place. But I'll assume you have a good reason for using Python 2, and can't just upgrade.


    The proximate cause here (assuming you're using Python 2.7 on a POSIX platform—things can be more complicated on older 2.x, and on Windows) is the value of sys.stdout.encoding. When you start up the interpreter, it does the equivalent of this pseudocode:

    if isatty(stdoutfd):
        sys.stdout.encoding = parse_locale(os.environ('LC_CTYPE'))
    else:
        sys.stdout.encoding = None
    

    And every time you write to a file, including sys.stdout, including implicitly from a print statement, it does something like this:

    if isinstance(s, unicode):
        if self.encoding:
            s = s.encode(self.encoding)
        else:
            s = s.encode(sys.getdefaultencoding())
    

    The actual code does standard POSIX stuff looking for fallbacks like LANG, and hardcodes a fallback to UTF-8 in some cases for Mac OS X, etc., but this is close enough.


    This is only sparsely documented, under file.encoding:

    The encoding that this file uses. When Unicode strings are written to a file, they will be converted to byte strings using this encoding. In addition, when the file is connected to a terminal, the attribute gives the encoding that the terminal is likely to use (that information might be incorrect if the user has misconfigured the terminal). The attribute is read-only and may not be present on all file-like objects. It may also be None, in which case the file uses the system default encoding for converting Unicode strings.


    To verify that this is your problem, try the following:

    $ python -c 'print __import__("sys").stdout.encoding'
    UTF-8
    $ python -c 'print __import__("sys").stdout.encoding' | cat
    None
    

    To be extra sure this is the problem:

    $ PYTHONIOENCODING=Latin-1 python -c 'print __import__("sys").stdout.encoding'
    Latin-1
    $ PYTHONIOENCODING=Latin-1 python -c 'print __import__("sys").stdout.encoding' | cat
    Latin-1
    

    So, how do you fix this?

    Well, the obvious way is to upgrade to Python 3.6, where you'll get UTF-8 in both cases, but I'll assume there's a reason you're using Python 2.7 and can't easily change it.

    The right solution is actually pretty complicated. But if you want a quick&dirty solution that works for your system, and for most current Linux and Mac systems with standard Python 2.7 setups (even though it may be disastrously wrong for older Linux systems, older Python 2.x versions, and weird setups), you can either:

    • Set the environment variable PYTHONIOENCODING to override the detection and force UTF-8. Setting this in your profile or similar may be worth doing if you know that every terminal and every tool you're ever going to use from this account is UTF-8, although it's a terrible idea if that isn't true.
    • Check sys.stdout.encoding and wrap it with a 'UTF-8' encoding if it's None.
    • Explicitly .encode('UTF-8') on everything you print.