# what I currently have
print('你好')
# 你好
# this is what I want
print('你好')
# \uXXXX \uXXXX
How do I do this? I want to print all non-ascii characters in strings as unicode escape literals
You can convert strings to a debug representation with non-ASCII, non-printable characters converted to escape sequences using the ascii()
function:
As
repr()
, return a string containing a printable representation of an object, but escape the non-ASCII characters in the string returned byrepr()
using\x
,\u
or\U
escapes.
For Unicode codepoints in the range U+0100-U+FFFF this uses \uhhhh
escapes; for the Latin-1 range (U+007F-U+00FF) \xhh
escapes are used instead. Note that the output qualifies as valid Python syntax to re-create the string, so quotes are included:
>>> print('你好')
你好
>>> print(ascii('你好'))
'\u4f60\u597d'
>>> print(ascii('ASCII is not changed, Latin-1 (åéîøü) is, as are all higher codepoints, such as 你好'))
'ASCII is not changed, Latin-1 (\xe5\xe9\xee\xf8\xfc) is, as are all higher codepoints, such as \u4f60\u597d'
If you must have \uhhhh
for everything, you'll have to do your own conversion:
import re
def escape_unicode(t, _p=re.compile(r'[\u0080-\U0010ffff]')):
def escape(match):
char = ord(match.group())
return '\\u{:04x}'.format(char) if char < 0x10000 else '\\U{:08x}'.format(char)
return _p.sub(escape, t)
The above function does not add quotes like the ascii()
function does:
>>> print(escape_unicode('你好'))
\u4f60\u597d
>>> print(escape_unicode('ASCII is not changed, Latin-1 (åéîøü) is, as are all higher codepoints, such as 你好'))
ASCII is not changed, Latin-1 (\u00e5\u00e9\u00ee\u00f8\u00fc) is, as are all higher codepoints, such as \u4f60\u597d