Search code examples
pythonunicode

Remove punctuation from Unicode formatted strings


I have a function that removes punctuation from a list of strings:

def strip_punctuation(input):
    x = 0
    for word in input:
        input[x] = re.sub(r'[^A-Za-z0-9 ]', "", input[x])
        x += 1
    return input

I recently modified my script to use Unicode strings so I could handle other non-Western characters. This function breaks when it encounters these special characters and just returns empty Unicode strings. How can I reliably remove punctuation from Unicode formatted strings?


Solution

  • You could use unicode.translate() method:

    import unicodedata
    import sys
    
    tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
                          if unicodedata.category(unichr(i)).startswith('P'))
    def remove_punctuation(text):
        return text.translate(tbl)
    

    You could also use r'\p{P}' that is supported by regex module:

    import regex as re
    
    def remove_punctuation(text):
        return re.sub(ur"\p{P}+", "", text)