Search code examples
pythonregexpython-3.xpunctuation

Python 3 Regex: remove all punctuation, except special word pattern


I have this text pattern -ABC_ABC- or -ABC- or -ABC_ABC_ABC-

My regex Pattern:

([\-]+[A-Z]+(?:[\_]?[A-Z])+[\-]+)

i wanna remove all string punctuation except for the above pattern. can i use regex substitution for case like this?

Input String :

Lorem Ipsum, simply dummy text -TOKEN_ABC-, yes! 

Expect :

Lorem Ipsum simply dummy text -TOKEN_ABC- yes 

i have finished using the if method, but it feels less efficient because i have to check every word.

sentence_list=[]
for word in text:
    if re.match(r"([-][A-Z]+(?:[_]?[A-Z]*[-]))", word.text):
        sentence_list.append(word.text)
    else:
        text2 = re.sub(r"([^\w\s]|[\-_])", r"", word.text)
        sentence_list.append(text2)
return " ".join(sentence_list)

Solution

  • Using regex module instead of re with verbs (*SKIP)(*FAIL):

    import regex
    text = 'Lorem Ipsum, simply dummy text -TOKEN_ABC-, yes! '
    res = regex.sub(r'-[A-Z]+(?:_[A-Z]+)*-(*SKIP)(*FAIL)|[^\w\s]+', '', text)
    print (res)
    

    Output:

    Lorem Ipsum simply dummy text -TOKEN_ABC- yes
    

    Explanation:

        -               # a hyphen
        [A-Z]+          # 1 or more capitals
        (?:             # non capture group
          _             # underscore
          [A-Z]+        # 1 or more capitals
        )*              # end group, may appear 0 or more times
        -               # a hyphen
        (*SKIP)         # forget the match
        (*FAIL)         # and fail
      |                 # OR
        [^\w\s]+        # 1 or more non word characters or spaces