Search code examples
pythonregexpython-2.7unicodepunctuation

Remove selected punctuation from unicode strings


I am working with a set of unicode strings and using the following piece of code (as shown in Remove punctuation from Unicode formatted strings):

import regex

def punc(text):
    return regex.sub(ur"\p{P}+", " ", text)

I wanted to go one step further and try to selectively keep certain punctuations. For example - need not be removed from the unicode string. What would be the best way to do that? Thanks in advance! :)


Solution

  • You can negate the \p{P} with \P{P} then put it in a negated character class ([^…]) along with whatever characters you want to keep, like this:

    return regex.sub(ur"[^\P{P}-]+", " ", text)
    

    This will match one or more of any character in \p{P} except those that are also defined inside the character class.

    Remember that - is a special character within a character class. If it doesn't appear at the start or end of the character class, you'll probably need to escape it.


    Another solution would be to use a negative lookahead ((?!…)) or negative lookbehind ((?<!…))

    return regex.sub(ur"((?!-)\p{P})+", " ", text)
    
    return regex.sub(ur"(\p{P}(?<!-))+", " ", text)
    

    But for something like this I'd recommend the character class instead.