Search code examples
pythonstringunicodeutf-8

Removing literal backslashes from utf-8 encoded strings in python


I have a bunch of strings containing UTF-8 encoded symbols, for example '\\u00f0\\u009f\\u0098\\u0086'. In that case, it represents this emoji 😆, encoded in UTF-8. I want to be able to replace it to the literal emoji. The solution someone recommended to me was to encoded it into latin-1 and then decode it to utf-8. So,

'\u00f0\u009f\u0098\u0086'.encode('latin-1').decode('utf-8')

gives me the output

'😆'

Unfortunately, all the strings with those codes have a literal backslash into them, so whenever I to do the same operations,

'\\u00f0\\u009f\\u0098\\u0086'.encode('latin-1').decode('utf-8')

I get the following result,

'\\u00f0\\u009f\\u0098\\u0086'

Is there a way to remove those backslashes? Because if I replace them with an empty string, all backslashes disappear.


Solution

  • I don't know where you're getting that string from, but it's an.... unusual... way of representing the codepoint. U+1F606 SMILING FACE WITH OPEN MOUTH AND TIGHTLY-CLOSED EYES is encoded in UTF-8 as the bytes F0 9F 98 86. In Python string escapes, \uXXXX is used to represent an entire codepoint in the Basic Multilingual Plane, and \UXXXXXXXX codepoints beyond it (Like this one), not a single byte of its UTF-8 encoding. So you'd expect to see it represented in a string as '\U0001F606'

    Anyways, the following will extract the last two hex digits of each escape sequence, turn them into a byte array, and then decode the resulting UTF-8 data into a string:

    import re
    str='\\u00f0\\u009f\\u0098\\u0086'
    print(b''.join([ bytes.fromhex(m.group(1)) for m in re.finditer(r'\\u[0-9a-fA-F]{2}([0-9a-fA-F]{2})', str) ]).decode())
    # Displays 😆