Search code examples
pythonregexstringuppercasepython-re

Removing acronyms using regex , based on uppercase characters following parenthesis


How to remove the following:

  • Acronyms starting with opening bracket and followed by upper case or number: e.g. '(ABC' or '(ABC)' or '(ABC-2A)' or '(ABC-1)'.

But NOT the words that are between parenthesis starting with uppercase and followed by lowercase e.g. '(Bobby)' or '(Bob went to the beach..)' --> This is the part I am struggling with.


text = ['(ABC went to the beach', 'The girl (ABC-2A) is walking', 'The dog (Bobby) is being walked', 'They are there (ABC)' ]
for string in text:
  cleaned_acronyms = re.sub(r'\([A-Z]*\)?', '', string)
  print(cleaned_acronyms)

#current output:
>> 'went to the beach' #Correct
>>'The girl -2A) is walking' #Not correct
>>'The dog obby) is being walked' #Not correct
>>'They are there' #Correct


#desired & correct output:
>> 'went to the beach'
>>'The girl is walking'
>>'The dog (Bobby) is being walked' #(Bobby) is NOT an acronym (uppercase+lowercase)
>>'They are there'

Solution

  • Have a try with a negative lookahead:

    \((?![A-Z][a-z])[A-Z\d-]+\)?\s*
    

    See an online demo

    • \( - A literal opening paranthesis.
    • (?![A-Z][a-z]) - Negative lookahead to assert position not followed by uppercase followed by lowercase.
    • [A-Z\d-]+ - Match 1+ uppercase alpha chars, digits or hyphens.
    • \)? - An optional literal closing paranthesis.
    • \s* - 0+ whitespace characters.

    Some sample Python script:

    import re
    text = ['(ABC went to the beach', 'The girl (ABC-2A) is walking', 'The dog (Bobby) is being walked', 'They are there (ABC)' ]
    for string in text:
      cleaned_acronyms = re.sub(r'\((?![A-Z][a-z])[A-Z\d-]+\)?\s*', '', string)
      print(cleaned_acronyms)
    

    Prints:

    went to the beach
    The girl is walking
    The dog (Bobby) is being walked
    They are there