Search code examples
python-3.xunicodedecoding

Display raw JSON with non-ascii characters


I'm having trouble displaying raw JSON data in the terminal, using Python3. I get the json as a response from urllib:

r = urlopen(request)
response = r.read()

The result is a byte string b"...", part of which contains non-ASCII characters like b"Chybn\\u00e9 heslo", which should give me this "Chybné heslo".

But I don't know how to decode it to display "Chybné heslo", if I do:

print(b"Chybn\\u00e9 heslo".decode('utf-8'))

I just get "Chybn\u00e9 heslo". What am I doing wrong here?


Solution

  • Use unicode-escape codec:

    byte_str = b"Chybn\u00e9 heslo"
    print(byte_str.decode('unicode-escape')) # Chybné heslo
    

    The reason of your problem is that in byte-strings \u00e9 is not a unicode code point.
    It's just a sequence of bytes:

    >>> len(b'\u00e9') # whereas len('\u00e9') == 1
    6 
    
    >>> [b for b in b'\u00e9']
    [92, 117, 48, 48, 101, 57]
    

    These bytes are also UTF-8 bytes, so when you decode them with this encoding you get the corresponding sequence of characters:

    >>> b'\u00e9'.decode('utf-8')
    '\\u00e9'
    
    >>> [chr(b) for b in b'\u00e9'] # decoding in 'byte-by-byte' mode
    ['\\', 'u', '0', '0', 'e', '9']
    

    Also note that \\ and \ are equivalent in some strings (for more information check this).
    For example:

    >>> b'\\u' == b'\u'
    True
    >>> b'\\u00e9' == b'\u00e9'
    True
    >>> b'\\n' == b'\n'
    False
    

    >>> '\\u00e9' == '\u00e9'
    False
    
    >>> '\\z' == '\z' 
    True