python regex python-2.7 unicode punctuation

Remove selected punctuation from unicode strings

I am working with a set of unicode strings and using the following piece of code (as shown in Remove punctuation from Unicode formatted strings):

import regex

def punc(text):
    return regex.sub(ur"\p{P}+", " ", text)

I wanted to go one step further and try to selectively keep certain punctuations. For example - need not be removed from the unicode string. What would be the best way to do that? Thanks in advance! :)

Solution

You can negate the \p{P} with \P{P} then put it in a negated character class ([^…]) along with whatever characters you want to keep, like this:

return regex.sub(ur"[^\P{P}-]+", " ", text)

This will match one or more of any character in \p{P} except those that are also defined inside the character class.

Remember that - is a special character within a character class. If it doesn't appear at the start or end of the character class, you'll probably need to escape it.

Another solution would be to use a negative lookahead ((?!…)) or negative lookbehind ((?<!…))

return regex.sub(ur"((?!-)\p{P})+", " ", text)

return regex.sub(ur"(\p{P}(?<!-))+", " ", text)

But for something like this I'd recommend the character class instead.