Search code examples
pythonregexunicode

Selective replacement of unicode characters in Python using regex


There are many answers as to how one can use regex to remove unicode characters in Python.

See Remove Unicode code (\uxxx) in string Python and Python regex module "re" match unicode characters with \u

However, in my case, I don't want to replace every unicode character but only the ones that are displayed with their \u code, not the ones that are properly shown as characters. I have tried both solutions and they remove both types of unicode characters.

\u2002pandemic becomes pandemic and master’s becomes masters

Is there a general solutions to removing the first type of unicode characters but keeping the second kind?


Solution

  • This uses the idea that the debug representation (repr()) of a text will show escape codes for non-printable characters, so it removes those escape codes (three types: \xnn, \unnnn, \Unnnnnnnn) and evaluates the result:

    import re
    import ast
    
    text = '\x19\x40\u2002\u2019\U0001e526\U0001f235\\u1234\\U00012345\\xff\\\u2002'
    #       ^^^^    ^^^^^^      ^^^^^^^^^^                                   ^^^^^^
    # To remove above, others are printable escape codes or literal backslashes.
    # If preceded by an odd number of backslashes, it's an escape code.
    print('printed text:   ', text)
    print('repr() text:    ', repr(text))
    clean_text = ast.literal_eval(re.sub(r'''(?x)                # verbose mode
                                             (?<!\\)             # not preceded by literal backslash
                                             ((?:\\\\)*)         # zero or more pairs literal backslashes (group 1)
                                             \\                  # match a literal backslash
                                             (?:                 # non-capturing group
                                             (?:x[0-9a-f]{2}) |  # match an x and 2 hexadecimal digits OR
                                             (?:u[0-9a-f]{4}) |  # match a u and 4 hex digits OR
                                             (?:U[0-9a-f]{8})    # match a U and 8 hex digits
                                             )                   # end non-capturing group
                                             ''',
                                             r'\1'               # replace with group 1 (pairs of backslashes, if any)
                                             , repr(text)))      # string to operate on
    print('cleaned text:   ', clean_text)
    print('cleaned repr(): ', repr(clean_text))
    

    Output:

    printed text:    @ ’𞔦🈵\u1234\U00012345\xff\ 
    repr() text:     '\x19@\u2002’\U0001e526🈵\\u1234\\U00012345\\xff\\\u2002'
    cleaned text:    @’🈵\u1234\U00012345\xff\
    cleaned repr():  '@’🈵\\u1234\\U00012345\\xff\\'
    

    Note you may not want to remove all characters that display as escape codes. Their str() (print display) vs. repr() (debug display) may be something desirable. For example, \u2002 is an EN SPACE (another type of SPACE character) and prints as a space. The debug representation only shows it as an escape code so you can tell the difference between an ASCII SPACE and an EN SPACE.