When running the code:
#! /usr/bin/env python
# -*- coding: UTF-8 -*-
import xml.etree.ElementTree as ET
print ET.fromstring('<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root><road>vägen</road></root>').find('road').text
Produces the expected output vägen
, however if piping this to wc -l
I get a UnicodeEncodeError, e.g. (TEerr.py holds the code snippet given above):
:~> ETerr.py | wc -l
Traceback (most recent call last):
File "./ETerr.py", line 5, in <module>
print ET.fromstring('<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root><road>vägen</road></root>').find('road').text
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 1: ordinal not in range(128)
0
:~>
How can the code behave differently if its output is piped or not and how can I fix it so that it doesn't.
Please note that the code snippet above is merely set up to demonstrate the issue with as little code as possible, in the actual script where I need to resolve the issue the xml is retrieved using urllib
hence I have little control over its format.
First, let me point out that this is not a problem in Python 3, and fixing it is in fact one of the reasons that it was worth a compatibility-breaking change to the language in the first place. But I'll assume you have a good reason for using Python 2, and can't just upgrade.
The proximate cause here (assuming you're using Python 2.7 on a POSIX platform—things can be more complicated on older 2.x, and on Windows) is the value of sys.stdout.encoding
. When you start up the interpreter, it does the equivalent of this pseudocode:
if isatty(stdoutfd):
sys.stdout.encoding = parse_locale(os.environ('LC_CTYPE'))
else:
sys.stdout.encoding = None
And every time you write
to a file, including sys.stdout
, including implicitly from a print
statement, it does something like this:
if isinstance(s, unicode):
if self.encoding:
s = s.encode(self.encoding)
else:
s = s.encode(sys.getdefaultencoding())
The actual code does standard POSIX stuff looking for fallbacks like LANG
, and hardcodes a fallback to UTF-8 in some cases for Mac OS X, etc., but this is close enough.
This is only sparsely documented, under file.encoding
:
The encoding that this file uses. When Unicode strings are written to a file, they will be converted to byte strings using this encoding. In addition, when the file is connected to a terminal, the attribute gives the encoding that the terminal is likely to use (that information might be incorrect if the user has misconfigured the terminal). The attribute is read-only and may not be present on all file-like objects. It may also be
None
, in which case the file uses the system default encoding for converting Unicode strings.
To verify that this is your problem, try the following:
$ python -c 'print __import__("sys").stdout.encoding'
UTF-8
$ python -c 'print __import__("sys").stdout.encoding' | cat
None
To be extra sure this is the problem:
$ PYTHONIOENCODING=Latin-1 python -c 'print __import__("sys").stdout.encoding'
Latin-1
$ PYTHONIOENCODING=Latin-1 python -c 'print __import__("sys").stdout.encoding' | cat
Latin-1
So, how do you fix this?
Well, the obvious way is to upgrade to Python 3.6, where you'll get UTF-8 in both cases, but I'll assume there's a reason you're using Python 2.7 and can't easily change it.
The right solution is actually pretty complicated. But if you want a quick&dirty solution that works for your system, and for most current Linux and Mac systems with standard Python 2.7 setups (even though it may be disastrously wrong for older Linux systems, older Python 2.x versions, and weird setups), you can either:
PYTHONIOENCODING
to override the detection and force UTF-8. Setting this in your profile
or similar may be worth doing if you know that every terminal and every tool you're ever going to use from this account is UTF-8, although it's a terrible idea if that isn't true.sys.stdout.encoding
and wrap it with a 'UTF-8'
encoding if it's None
..encode('UTF-8')
on everything you print.