In python 3.11 we can encode a string like:
string.encode('ascii', 'backslashreplace')
Which works neatly for say: hellö
=> hell\\xf6
However when I insert hellö w\\xf6rld
I get hell\\xf6 w\\xf6rld
(notice the second one has an literal part that looks like a character escape sequence)
Or in other words the following holds:
'hellö wörld'.encode('ascii', 'backslashreplace') == 'hellö w\\xf6rld'.encode('ascii', 'backslashreplace')
Which obviously means that data has been lost by the encoding.
Is there a way to make python actually encode correctly? So also backslashes are escaped themselves? Or a library to do so?
Use the unicode_escape
codec and no error handler instead of the ascii
codec with error handler. You are getting errors with the data being non-ascii and the error handler is causing the loss. The result will be only ASCII characters but it will handle the backslashes:
>>> 'hellö wörld'.encode('unicode_escape') == 'hell\\xf6 w\\xf6rld'.encode('unicode_escape')
False
>>> 'hellö wörld'.encode('unicode_escape')
b'hell\\xf6 w\\xf6rld'
>>> 'hell\\xf6 w\\xf6rld'.encode('unicode_escape')
b'hell\\\\xf6 w\\\\xf6rld'
If you don't have an ASCII requirement, then just .encode()
(default UTF-8 in Python 3 which handles all Unicode). Then .decode()
to restore.