Search code examples
pythonutf-8escapingutfavahi

Converting escaped characters to utf in Python


Is there an elegant way to convert "test\207\128" into "testπ" in python?

My issue stems from using avahi-browse on Linux, which has a -p flag to output information in an easy to parse format. However the problem is that it outputs non alpha-numeric characters as escaped sequences. So a service published as "name#id" gets output by avahi-browse as "name\035id". This can be dealt with by splitting on the \, dropping a leading zero and using chr(35) to recover the #. This solution breaks on multi-byte utf characters such as "π" which gets output as "\207\128".


Solution

  • The input string you have is an encoding of a UTF-8 string, in a format that Python can't deal with natively. This means you'll need to write a simple decoder, then use Python to translate the UTF-8 string to a string object:

    import re
    value = r"test\207\128"
    # First off turn this into a byte array, since it's not a unicode string
    value = value.encode("utf-8")
    # Now replace any "\###" with a byte character based off
    # the decimal number captured
    value = re.sub(b"\\\\([0-9]{3})", lambda m: bytes([int(m.group(1))]), value)
    # And now that we have a normal UTF-8 string, decode it back to a string
    value = value.decode("utf-8")
    print(value)
    # Outputs: testπ