I have a bunch of strings containing UTF-8 encoded symbols, for example '\\u00f0\\u009f\\u0098\\u0086'
.
In that case, it represents this emoji 😆
, encoded in UTF-8. I want to be able to replace it to the literal emoji. The solution someone recommended to me was to encoded it into latin-1
and then decode it to utf-8
. So,
'\u00f0\u009f\u0098\u0086'.encode('latin-1').decode('utf-8')
gives me the output
'😆'
Unfortunately, all the strings with those codes have a literal backslash into them, so whenever I to do the same operations,
'\\u00f0\\u009f\\u0098\\u0086'.encode('latin-1').decode('utf-8')
I get the following result,
'\\u00f0\\u009f\\u0098\\u0086'
Is there a way to remove those backslashes? Because if I replace them with an empty string, all backslashes disappear.
I don't know where you're getting that string from, but it's an.... unusual... way of representing the codepoint. U+1F606 SMILING FACE WITH OPEN MOUTH AND TIGHTLY-CLOSED EYES is encoded in UTF-8 as the bytes F0 9F 98 86
. In Python string escapes, \uXXXX
is used to represent an entire codepoint in the Basic Multilingual Plane, and \UXXXXXXXX
codepoints beyond it (Like this one), not a single byte of its UTF-8 encoding. So you'd expect to see it represented in a string as '\U0001F606'
Anyways, the following will extract the last two hex digits of each escape sequence, turn them into a byte array, and then decode the resulting UTF-8 data into a string:
import re
str='\\u00f0\\u009f\\u0098\\u0086'
print(b''.join([ bytes.fromhex(m.group(1)) for m in re.finditer(r'\\u[0-9a-fA-F]{2}([0-9a-fA-F]{2})', str) ]).decode())
# Displays 😆