Search code examples
python-3.xunicodeutf-8unicode-escapes

Unescape double backslash sequences in Python 3


I have a string like this:

'hello this is nice\\r\\n\\xc2\\xa0 goodbye'

I need to convert this into plain utf-8 text.

The codecs library did not solve this:

codecs.unicode_escape_decode(x)[0]
'hello this is nice\r\nÂ\xa0 goodbye'

How do I turn that string into clean utf-8 text?


Solution

  • Not particularly elegant, but this seems to do what you are asking.

    >>> codecs.unicode_escape_decode(x)[0].encode('latin-1').decode('utf-8')
    'hello this is nice\r\n\xa0 goodbye'
    

    Slightly obscurely, the Latin-1 encoding has the attractive property that every byte encodes exactly that character code, so it can be used to transparently convert bytes to string or vice versa.

    (In case it's not obvious, b'\xc2\xa0' is the UTF-8 encoding of U+00A0.)