Search code examples
pythonpython-3.xunicodeencoding

How to decode escaped Unicode characters?


I'm trying to replace escaped Unicode characters with the actual characters:

string = "\\u00c3\\u00a4"
print(string.encode().decode("unicode-escape"))

The expected output is ä, the actual output is ä.


Solution

  • The following solution seems to work in similar situations (see for example this case about decoding broken Hebrew text):

    ("\\u00c3\\u00a4"
      .encode('latin-1')
      .decode('unicode_escape')
      .encode('latin-1')
      .decode('utf-8')
    )
    

    Outputs:

    'ä'
    

    This works as follows:

    • The string that contains only ascii-characters '\', 'u', '0', '0', 'c', etc. is converted to bytes using some not-too-crazy 8-bit encoding (doesn't really matter which one, as long as it treats ASCII characters properly)
    • Use a decoder that interprets the '\u00c3' escapes as unicode code point U+00C3 (LATIN CAPITAL LETTER A WITH TILDE, 'Ã'). From the point of view of your code, it's nonsense, but this unicode code point has the right byte representation when again encoded with ISO-8859-1/'latin-1', so...
    • encode it again with 'latin-1'
    • Decode it "properly" this time, as UTF-8

    Again, same remark as in the linked post: before investing too much energy trying to repair the broken text, you might want to try to repair the part of the code that is doing the encoding in such a strange way. Not breaking it in the first place is better than breaking it and then repairing it again.