I have this text pattern -ABC_ABC- or -ABC- or -ABC_ABC_ABC-
My regex Pattern:
([\-]+[A-Z]+(?:[\_]?[A-Z])+[\-]+)
i wanna remove all string punctuation except for the above pattern. can i use regex substitution for case like this?
Input String :
Lorem Ipsum, simply dummy text -TOKEN_ABC-, yes!
Expect :
Lorem Ipsum simply dummy text -TOKEN_ABC- yes
i have finished using the if method, but it feels less efficient because i have to check every word.
sentence_list=[]
for word in text:
if re.match(r"([-][A-Z]+(?:[_]?[A-Z]*[-]))", word.text):
sentence_list.append(word.text)
else:
text2 = re.sub(r"([^\w\s]|[\-_])", r"", word.text)
sentence_list.append(text2)
return " ".join(sentence_list)
Using regex
module instead of re
with verbs (*SKIP)(*FAIL)
:
import regex
text = 'Lorem Ipsum, simply dummy text -TOKEN_ABC-, yes! '
res = regex.sub(r'-[A-Z]+(?:_[A-Z]+)*-(*SKIP)(*FAIL)|[^\w\s]+', '', text)
print (res)
Output:
Lorem Ipsum simply dummy text -TOKEN_ABC- yes
Explanation:
- # a hyphen
[A-Z]+ # 1 or more capitals
(?: # non capture group
_ # underscore
[A-Z]+ # 1 or more capitals
)* # end group, may appear 0 or more times
- # a hyphen
(*SKIP) # forget the match
(*FAIL) # and fail
| # OR
[^\w\s]+ # 1 or more non word characters or spaces