Search code examples
pythonstringnon-printable

Stripping non printable characters from a string in python


I use to run

$s =~ s/[^[:print:]]//g;

on Perl to get rid of non printable characters.

In Python there's no POSIX regex classes, and I can't write [:print:] having it mean what I want. I know of no way in Python to detect if a character is printable or not.

What would you do?

EDIT: It has to support Unicode characters as well. The string.printable way will happily strip them out of the output. curses.ascii.isprint will return false for any unicode character.


Solution

  • Iterating over strings is unfortunately rather slow in Python. Regular expressions are over an order of magnitude faster for this kind of thing. You just have to build the character class yourself. The unicodedata module is quite helpful for this, especially the unicodedata.category() function. See Unicode Character Database for descriptions of the categories.

    import unicodedata, re, itertools, sys
    
    all_chars = (chr(i) for i in range(sys.maxunicode))
    categories = {'Cc'}
    control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
    # or equivalently and much more efficiently
    control_chars = ''.join(map(chr, itertools.chain(range(0x00,0x20), range(0x7f,0xa0))))
    
    control_char_re = re.compile('[%s]' % re.escape(control_chars))
    
    def remove_control_chars(s):
        return control_char_re.sub('', s)
    

    For Python2

    import unicodedata, re, sys
    
    all_chars = (unichr(i) for i in xrange(sys.maxunicode))
    categories = {'Cc'}
    control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
    # or equivalently and much more efficiently
    control_chars = ''.join(map(unichr, range(0x00,0x20) + range(0x7f,0xa0)))
    
    control_char_re = re.compile('[%s]' % re.escape(control_chars))
    
    def remove_control_chars(s):
        return control_char_re.sub('', s)
    

    For some use-cases, additional categories (e.g. all from the control group might be preferable, although this might slow down the processing time and increase memory usage significantly. Number of characters per category:

    • Cc (control): 65
    • Cf (format): 161
    • Cs (surrogate): 2048
    • Co (private-use): 137468
    • Cn (unassigned): 836601

    Edit Adding suggestions from the comments.