Search code examples
pythonpython-3.xstringunicodenon-printing-characters

Removing control characters from a string in python


I currently have the following code

def removeControlCharacters(line):
    i = 0
    for c in line:
        if (c < chr(32)):
            line = line[:i - 1] + line[i+1:]
            i += 1
    return line

This is just does not work if there are more than one character to be deleted.


Solution

  • There are hundreds of control characters in unicode. If you are sanitizing data from the web or some other source that might contain non-ascii characters, you will need Python's unicodedata module. The unicodedata.category(…) function returns the unicode category code (e.g., control character, whitespace, letter, etc.) of any character. For control characters, the category always starts with "C".

    This snippet removes all control characters from a string.

    import unicodedata
    def remove_control_characters(s):
        return "".join(ch for ch in s if unicodedata.category(ch)[0]!="C")
    

    Examples of unicode categories:

    >>> from unicodedata import category
    >>> category('\r')      # carriage return --> Cc : control character
    'Cc'
    >>> category('\0')      # null character ---> Cc : control character
    'Cc'
    >>> category('\t')      # tab --------------> Cc : control character
    'Cc'
    >>> category(' ')       # space ------------> Zs : separator, space
    'Zs'
    >>> category(u'\u200A') # hair space -------> Zs : separator, space
    'Zs'
    >>> category(u'\u200b') # zero width space -> Cf : control character, formatting
    'Cf'
    >>> category('A')       # letter "A" -------> Lu : letter, uppercase
    'Lu'
    >>> category(u'\u4e21') # 両 ---------------> Lo : letter, other
    'Lo'
    >>> category(',')       # comma  -----------> Po : punctuation
    'Po'
    >>>