Search code examples
pythonregexabbreviation

find abbreviation right after a specific word using regular expression


My goal is to identify abbreviation word that appears right after @PROG$ and change it to @PROG$. (eg. ALI -> @PROG$)

Input

s = "Background (UNASSIGNED): Previous study of ours showed that @PROG$ (ALI) and C-reactive protein (CRP) are independent significant prognostic factors in operable non-small cell lung cancer (NSCLC) patients."

Output

"Background (UNASSIGNED): Previous study of ours showed that @PROG$ @PROG$ and C-reactive protein (CRP) are independent significant prognostic factors in operable non-small cell lung cancer (NSCLC) patients."

I tried something like this re.findall('(\(.*?\))', s) which gave me all the abbreviations. Any help from here? what I need to fix?


Solution

  • You can use a re.sub solution like

    import re
    s = "Background (UNASSIGNED): Previous study of ours showed that @PROG$ (ALI) and C-reactive protein (CRP) are independent significant prognostic factors in operable non-small cell lung cancer (NSCLC) patients."
    print( re.sub(r'(@PROG\$\s+)\([A-Z]+\)', r'\1@PROG$', s) )
    # => Background (UNASSIGNED): Previous study of ours showed that @PROG$ @PROG$ and C-reactive protein (CRP) are independent significant prognostic factors in operable non-small cell lung cancer (NSCLC) patients.
    

    See the Python demo. The regex is

    (@PROG\$\s+)\([A-Z]+\)
    

    See the regex demo. Details:

    • (@PROG\$\s+) - Group 1 (\1 refers to this group value from the replacement pattern): @PROG$ and one or more whitespaces
    • \( - a ( char
    • [A-Z]+ - one or more uppercase ASCII letters (replace with [^()]* to match anything in between parentheses except for ( and ))
    • \) - a ) char.