Search code examples
pythonpython-3.xunicodepython-3.4non-ascii-characters

How to print non-ascii characters as \uXXXX literals


# what I currently have

print('你好')

# 你好

# this is what I want

print('你好')

# \uXXXX \uXXXX

How do I do this? I want to print all non-ascii characters in strings as unicode escape literals


Solution

  • You can convert strings to a debug representation with non-ASCII, non-printable characters converted to escape sequences using the ascii() function:

    As repr(), return a string containing a printable representation of an object, but escape the non-ASCII characters in the string returned by repr() using \x, \u or \U escapes.

    For Unicode codepoints in the range U+0100-U+FFFF this uses \uhhhh escapes; for the Latin-1 range (U+007F-U+00FF) \xhh escapes are used instead. Note that the output qualifies as valid Python syntax to re-create the string, so quotes are included:

    >>> print('你好')
    你好
    >>> print(ascii('你好'))
    '\u4f60\u597d'
    >>> print(ascii('ASCII is not changed, Latin-1 (åéîøü) is, as are all higher codepoints, such as 你好'))
    'ASCII is not changed, Latin-1 (\xe5\xe9\xee\xf8\xfc) is, as are all higher codepoints, such as \u4f60\u597d'
    

    If you must have \uhhhh for everything, you'll have to do your own conversion:

    import re
    
    def escape_unicode(t, _p=re.compile(r'[\u0080-\U0010ffff]')):
        def escape(match):
            char = ord(match.group())
            return '\\u{:04x}'.format(char) if char < 0x10000 else '\\U{:08x}'.format(char)
        return _p.sub(escape, t)
    

    The above function does not add quotes like the ascii() function does:

    >>> print(escape_unicode('你好'))
    \u4f60\u597d
    >>> print(escape_unicode('ASCII is not changed, Latin-1 (åéîøü) is, as are all higher codepoints, such as 你好'))
    ASCII is not changed, Latin-1 (\u00e5\u00e9\u00ee\u00f8\u00fc) is, as are all higher codepoints, such as \u4f60\u597d