I am working with a set of unicode strings and using the following piece of code (as shown in Remove punctuation from Unicode formatted strings):
import regex
def punc(text):
return regex.sub(ur"\p{P}+", " ", text)
I wanted to go one step further and try to selectively keep certain punctuations. For example -
need not be removed from the unicode string. What would be the best way to do that? Thanks in advance! :)
You can negate the \p{P}
with \P{P}
then put it in a negated character class ([^…]
) along with whatever characters you want to keep, like this:
return regex.sub(ur"[^\P{P}-]+", " ", text)
This will match one or more of any character in \p{P}
except those that are also defined inside the character class.
Remember that -
is a special character within a character class. If it doesn't appear at the start or end of the character class, you'll probably need to escape it.
Another solution would be to use a negative lookahead ((?!…)
) or negative lookbehind ((?<!…)
)
return regex.sub(ur"((?!-)\p{P})+", " ", text)
return regex.sub(ur"(\p{P}(?<!-))+", " ", text)
But for something like this I'd recommend the character class instead.