s = "Hello隼"
s = s.encode("utf-8").decode("unicode_escape").encode("unicode_escape").decode("utf-8")
print(s)
This returns Hello\\xe9\\x9a\\xbc
. But why?!
Yes, I know those escaped Unicode characters are equivalent to the kanji. But they're not equivalent in the terminal, nor when they're inside an url or other cases that actually need the symbol.
How do I get the original string back? I thought these operations were supposed to be reversible. Just the utf-8 steps or the unicode_escape steps by themselves work, but when mixed together they break it.
TL;DR - don't decode byte strings with non-ASCII bytes with unicode_escape
. The result of an .encode('unicode_escape')
is ASCII-only.
To break down the issue:
s = 'Hello隼'
print(s)
s = s.encode('utf-8')
print(s)
s = s.decode('unicode_escape')
print(s)
s = s.encode('unicode_escape')
print(s)
s = s.decode('utf-8')
print(s)
Output (comments added):
Hello隼 # Original Unicode string
b'Hello\xe9\x9a\xbc' # bytes object with non-ASCII bytes *displayed* as escapes
Helloéš¼ # `unicode_escape` decodes non-ASCII *bytes* as latin-1
b'Hello\\xe9\\x9a\\xbc' # Latin-1 code points *literally* written as escapes.
Hello\xe9\x9a\xbc # only ASCII bytes ('\xe9' is 4 ASCII characters).
So to actually reverse, .decode('latin1')
is needed instead to reverse the .encode('unicode_escape')
:
s = 'Hello隼'
print(s)
s = s.encode('utf-8')
print(s)
s = s.decode('unicode_escape')
print(s)
s = s.encode('latin-1')
print(s)
s = s.decode('utf-8')
print(s)
Output:
Hello隼
b'Hello\xe9\x9a\xbc'
Helloéš¼
b'Hello\xe9\x9a\xbc' # Not *literal* escapes this time, but *displayed* escaped bytes.
Hello隼 # Multibyte UTF-8 sequence decoded correctly.
Better to skip the UTF-8 steps:
s = 'Hello隼'
print(s)
s = s.encode('unicode_escape')
print(s)
s = s.decode('unicode_escape')
print(s)
Output:
Hello隼
b'Hello\\u96bc' # Correct Unicode code point escape.
Hello隼