Search code examples
pythonstringtranslation

How to print out strings with unicode escape characters correctly


I am reading strings from a file which contain embedded unicode escape sequences, \u00e9 as an example. When I print the literal strings using print(), the encodings are translated by print into the correct character but if I get the strings from stdin and print them out, print doesn't convert the escape sequences into the unicode characters.

For example, when I use:

print ("Le Condamn\u00e9 \u00e0 mort") 

python correctly prints Le Condamné à mort however, if I get the same string from stdin I get: Le Condamn\u00e9 \u00e0 mort

Does anyone know how I can get python to translate the escape sequences to the correct unicode characters? Also, why does print behave differently when you give it a string literal rather than a string variable?


Solution

  • The \u00e0 is being stored as a Unicode number for python so that it is printed as a 'à'. When you get it from another file, it is completely in string form meaning it is then stored as a '\\u00e0' where every character is a string. A solution to this would be to identify where the '\\u00e0' is in the list and then replace it with the '\u00e0'

    Here is some code that will convert the '\\u00e0' in the string into the character its supposed to be.

    def special_char_fix(string):
        string = list(string)
        for pl, char in enumerate(string):
            if char == '\\':
                val = ''.join([string[pl + k + 2] for k in range(4)])
                for k in range(5):
                    string.pop(pl)
                string[pl] = str(chr(int(val, 16)))
        return ''.join(string)