Search code examples
pythonregexstringpython-regex

Python regex, remove all punctuation except hyphen for unicode string


I have this code for removing all punctuation from a regex string:

import regex as re    
re.sub(ur"\p{P}+", "", txt)

How would I change it to allow hyphens? If you could explain how you did it, that would be great. I understand that here, correct me if I'm wrong, P with anything after it is punctuation.


Solution

  • [^\P{P}-]+
    

    \P is the complementary of \p - not punctuation. So this matches anything that is not (not punctuation or a dash) - resulting in all punctuation except dashes.

    Example: http://www.rubular.com/r/JsdNM3nFJ3

    If you want a non-convoluted way, an alternative is \p{P}(?<!-): match all punctuation, and then check it wasn't a dash (using negative lookbehind).
    Working example: http://www.rubular.com/r/5G62iSYTdk