Search code examples
pythonunicodeunicode-escapes

How to decode partially escaped unicode string in python (mixed unicode and escaped unicode)?


Given the following string:

str = "\\u20ac €"

How to decode it into € €?

Using str.encode("utf-8").decode("unicode-escape") returns € â\x82¬

(To clarify, I am looking for a general solution how to decode any mix of unicode and escaped characters)


Solution

  • A simple and fast solution is to use re.sub to match \u and exactly four hexadecimal digits, and convert those digits into a Unicode code point:

    import re
    
    s = r"blah bl\uah \u20ac € b\u20aclah\u12blah blah"
    print(s)
    
    s = re.sub(r'\\u([0-9a-fA-F]{4})',lambda m: chr(int(m.group(1),16)),s)
    print(s)
    

    Output:

    blah bl\uah \u20ac € b\u20aclah\u12blah blah
    blah bl\uah € € b€lah\u12blah blah