Search code examples
pythonpython-3.xencodingnon-ascii-characters

Issue in encode/decode in python 3 with non-ascii character


I am trying to use python3 unicode_escape to escape \n in my string, but the challenge is there are non-ascii characters present in the whole string, and if I use utf8 to encode and then decode the bytes using unicode_escape then the special character gets garbled. Is there any way to have the \n escaped with a new line without garbling the special character?

s = "hello\\nworld└--"
print(s.encode('utf8').decode('unicode_escape'))

Expected Result:
hello
world└--

Actual Result:
hello
worldâ--

Solution

  • As user wowcha observes, the unicode-escape codec assumes a latin-1 encoding, but your string contains a character that is not encodable as latin-1.

    >>> s = "hello\\nworld└--"
    >>> s.encode('latin-1')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'latin-1' codec can't encode character '\u2514' in position 12: ordinal not in range(256)
    

    Encoding the string as utf-8 gets around the encoding problem, but results in mojibake when decoding from unicode-escape

    The solution is to use the backslashreplace error handler when encoding. This will convert the problem character to an escape sequence that can be encoded as latin-1 and does not get mangled when decoded from unicode-escape.

    >>> s.encode('latin-1', errors='backslashreplace')
    b'hello\\nworld\\u2514--'
    
    >>> s.encode('latin-1', errors='backslashreplace').decode('unicode-escape')
    'hello\nworld└--'
    
    >>> print(s.encode('latin-1', errors='backslashreplace').decode('unicode-escape'))
    hello
    world└--