I'm trying to replace escaped Unicode characters with the actual characters:
string = "\\u00c3\\u00a4"
print(string.encode().decode("unicode-escape"))
The expected output is ä
, the actual output is ä
.
The following solution seems to work in similar situations (see for example this case about decoding broken Hebrew text):
("\\u00c3\\u00a4"
.encode('latin-1')
.decode('unicode_escape')
.encode('latin-1')
.decode('utf-8')
)
Outputs:
'ä'
This works as follows:
'\'
, 'u'
, '0'
, '0'
, 'c'
, etc. is converted to bytes using some not-too-crazy 8-bit encoding (doesn't really matter which one, as long as it treats ASCII characters properly)'\u00c3'
escapes as unicode code point U+00C3 (LATIN CAPITAL LETTER A WITH TILDE, 'Ã'). From the point of view of your code, it's nonsense, but this unicode code point has the right byte representation when again encoded with ISO-8859-1
/'latin-1'
, so...'latin-1'
Again, same remark as in the linked post: before investing too much energy trying to repair the broken text, you might want to try to repair the part of the code that is doing the encoding in such a strange way. Not breaking it in the first place is better than breaking it and then repairing it again.