Search code examples
pythoncharacter-encoding

How to encode unicode to bytes, so that the original string can be retrieved? in python 3.11


In python 3.11 we can encode a string like:

string.encode('ascii', 'backslashreplace')

Which works neatly for say: hellö => hell\\xf6

However when I insert hellö w\\xf6rld I get hell\\xf6 w\\xf6rld (notice the second one has an literal part that looks like a character escape sequence)

Or in other words the following holds:

'hellö wörld'.encode('ascii', 'backslashreplace') == 'hellö w\\xf6rld'.encode('ascii', 'backslashreplace')

Which obviously means that data has been lost by the encoding.

Is there a way to make python actually encode correctly? So also backslashes are escaped themselves? Or a library to do so?


Solution

  • Use the unicode_escape codec and no error handler instead of the ascii codec with error handler. You are getting errors with the data being non-ascii and the error handler is causing the loss. The result will be only ASCII characters but it will handle the backslashes:

    >>> 'hellö wörld'.encode('unicode_escape') == 'hell\\xf6 w\\xf6rld'.encode('unicode_escape')
    False
    >>> 'hellö wörld'.encode('unicode_escape')
    b'hell\\xf6 w\\xf6rld'
    >>> 'hell\\xf6 w\\xf6rld'.encode('unicode_escape')
    b'hell\\\\xf6 w\\\\xf6rld'
    

    If you don't have an ASCII requirement, then just .encode() (default UTF-8 in Python 3 which handles all Unicode). Then .decode() to restore.