Python 2.7.3 UTF-8 Encoding Irreversible

I've come across a few very troublesome strings while crawling the web. In particular, a page advertises as being UTF-7, and though it's not quite UTF-7 that doesn't appear to be the issue. I'm not concerned with representing the exact intent of the text, but I just need to get into UTF-8 for downstream consumption.

The oddity I'm faced with is that I'm able to get a unicode string that cannot be first UTF-8 encoded and then decoded. I've distilled the string down as much as I can while still exhibiting the error:

bytes = [43, 105, 100, 41, 46, 101, 95, 39, 43, 105, 100, 43]
string = ''.join(chr(c) for c in bytes)

# This particular string happens to be advertised as UTF-7, though it is
# a bit malformed. We'll ignore these errors when decoding it.
decoded = string.decode('utf-7', 'ignore')

# This decoded string, however, cannot be encoded into UTF-8 and back:
error = decoded.encode('utf-8').decode('utf-8')

I've tried this on a number of systems successfully: Python 2.7.1 and 2.6.7 on Mac 10.5.7, Python 2.7.2 and 2.6.8 on CentOS. Unfortunately, on the machines we need it to work on it's failing with Python 2.7.3 on Ubuntu 12.04. On the failing system, I see:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf7 in position 4: invalid start byte

Here are some of the intermediate values that I see on the working vs. non-working systems:

# Working:
>>> repr(decoded)
'u".e_\'\\u89df"'
>>> repr(decoded.encode('utf-8'))
'".e_\'\\xe8\\xa7\\x9f"'

# Non-working:
>>> repr(decoded)
'u".e_\'\\U089d89df"'
>>> repr(decoded.encode('utf-8'))
'".e_\'\\xf7\\x98\\xa7\\x9f"'

The two are different after the first encoding, though why is a mystery to me still. I imagine that it's an issue with lacking some character tables, or an auxiliary library because it doesn't appear that anything between 2.7.2 and 2.7.3 would explain this behavior. On the systems where it works correctly, printing the unicode entity displays a Chinese symbol, but a placeholder on the system where it does not.

This leaves me to my question: does such an issue look familiar to anyone, or does anyone have an idea what supporting libraries I might be missing on the system that's having the issue?

Solution

The problem here is that the UTF-7 decode is, for some reason, returning you illegal Unicode characters.

It's basically not documented what happens when you've got a unicode object with illegal characters in it. The C API basically just tells you "don't do that, or things will break". The Python API doesn't mention it because it should be impossible, unless you've done something undefined with the C API, which is already covered.

Unless, of course, a bug in the built-in codecs causes it do to something undefined on your behalf. Which seems to be what's happening here.

One plausible reason you're seeing this on some platforms but not others is that the working platforms are all using narrow Unicode, meaning this problem can't possibly occur. (You can't have a code point > 0x10FFFF on a platform where code points are only 2 bytes, except by using UTF-16 surrogate bytes, so you'd presumably get the exception—or the ignore—at surrogate encoding.)

The fact that the illegal character you're getting is \U089d89df, and the character you get on a Mac (where the system Python build is narrow-Unicode) is \u89df, is pretty suggestive of some code taking a shortcut somewhere that assumes narrow Unicode. But to actually track down the bug, you'd need to look through multiple places in the source, and compare how Python is built on each platform (narrow-vs.-wide might not be the only difference), and/or look through bugs and change logs…

And ultimately, if you want to run on Ubuntu systems with that Python build, unless you want to write your own custom C module, you'll have to work around that bug, right?

So you're probably just looking for a simple workaround. In that case, this should work:

decoded = u''.join(c for c in decoded if ord(c) <= 10FFFF)

This strips out any characters whose code point is larger than the largest legal Unicode character. It should solve the problem anywhere it exists, and be harmless (except for some wasted CPU time) otherwise.

For many applications, you really only need to deal with BMP characters, and anything in the supplementary and private planes is more likely to be an error than actual data, so it may be simpler or more robust to use 0xFFFF instead of 0x10FFFF.