python python-3.x unicode python-3.4 non-ascii-characters

How to print non-ascii characters as \uXXXX literals

# what I currently have

print('你好')

# 你好

# this is what I want

print('你好')

# \uXXXX \uXXXX

How do I do this? I want to print all non-ascii characters in strings as unicode escape literals

Solution

You can convert strings to a debug representation with non-ASCII, non-printable characters converted to escape sequences using the ascii() function:

As repr(), return a string containing a printable representation of an object, but escape the non-ASCII characters in the string returned by repr() using \x, \u or \U escapes.

For Unicode codepoints in the range U+0100-U+FFFF this uses \uhhhh escapes; for the Latin-1 range (U+007F-U+00FF) \xhh escapes are used instead. Note that the output qualifies as valid Python syntax to re-create the string, so quotes are included:

>>> print('你好')
你好
>>> print(ascii('你好'))
'\u4f60\u597d'
>>> print(ascii('ASCII is not changed, Latin-1 (åéîøü) is, as are all higher codepoints, such as 你好'))
'ASCII is not changed, Latin-1 (\xe5\xe9\xee\xf8\xfc) is, as are all higher codepoints, such as \u4f60\u597d'

If you must have \uhhhh for everything, you'll have to do your own conversion:

import re

def escape_unicode(t, _p=re.compile(r'[\u0080-\U0010ffff]')):
    def escape(match):
        char = ord(match.group())
        return '\\u{:04x}'.format(char) if char < 0x10000 else '\\U{:08x}'.format(char)
    return _p.sub(escape, t)

The above function does not add quotes like the ascii() function does:

>>> print(escape_unicode('你好'))
\u4f60\u597d
>>> print(escape_unicode('ASCII is not changed, Latin-1 (åéîøü) is, as are all higher codepoints, such as 你好'))
ASCII is not changed, Latin-1 (\u00e5\u00e9\u00ee\u00f8\u00fc) is, as are all higher codepoints, such as \u4f60\u597d