How to remove the following:
But NOT the words that are between parenthesis starting with uppercase and followed by lowercase e.g. '(Bobby)' or '(Bob went to the beach..)' --> This is the part I am struggling with.
text = ['(ABC went to the beach', 'The girl (ABC-2A) is walking', 'The dog (Bobby) is being walked', 'They are there (ABC)' ]
for string in text:
cleaned_acronyms = re.sub(r'\([A-Z]*\)?', '', string)
print(cleaned_acronyms)
#current output:
>> 'went to the beach' #Correct
>>'The girl -2A) is walking' #Not correct
>>'The dog obby) is being walked' #Not correct
>>'They are there' #Correct
#desired & correct output:
>> 'went to the beach'
>>'The girl is walking'
>>'The dog (Bobby) is being walked' #(Bobby) is NOT an acronym (uppercase+lowercase)
>>'They are there'
Have a try with a negative lookahead:
\((?![A-Z][a-z])[A-Z\d-]+\)?\s*
See an online demo
\(
- A literal opening paranthesis.(?![A-Z][a-z])
- Negative lookahead to assert position not followed by uppercase followed by lowercase.[A-Z\d-]+
- Match 1+ uppercase alpha chars, digits or hyphens.\)?
- An optional literal closing paranthesis.\s*
- 0+ whitespace characters.Some sample Python script:
import re
text = ['(ABC went to the beach', 'The girl (ABC-2A) is walking', 'The dog (Bobby) is being walked', 'They are there (ABC)' ]
for string in text:
cleaned_acronyms = re.sub(r'\((?![A-Z][a-z])[A-Z\d-]+\)?\s*', '', string)
print(cleaned_acronyms)
Prints:
went to the beach
The girl is walking
The dog (Bobby) is being walked
They are there