Search code examples
pythonregexstringregex-grouppython-re

Regex string pattern insert operation


Input String:

The Proteinase-Activated Receptor-2 interskin (IS)-1ALPHA, -8, -97, -2ALPHA, and -3 was specific antagonist. The LL-37, -15, SAC-1, -7, and -21 in keratinocytes was good.

Output String:

The Proteinase-Activated Receptor-2 interskin (IS)-1ALPHA, interskin (IS)-8, interskin (IS)-97, interskin (IS)-2ALPHA, and -3 was specific antagonist. The LL-37, LL-15, SAC-1, SAC-7 and SAC-21 in keratinocytes was good.

Expected Output is:

The Proteinase-Activated Receptor-2 interskin (IS)-1ALPHA, interskin (IS)-8, interskin (IS)-97, interskin (IS)-2ALPHA and interskin (IS)-3 was specific antagonist. The LL-37, LL-15, SAC-1, SAC-7 and SAC-21 in keratinocytes was good.

I am not getting the interskin (IS)-3 part in my output string. Please look into my code and suggest the solution.

import re
string_a = "The Proteinase-Activated Receptor-2 interskin (IS)-1ALPHA, -8, -97, -2ALPHA, and -3 was specific antagonist. The LL-37, -15, SAC-1, -7, and -21 in keratinocytes was good."
print(string_a)
pattern = re.compile(r"\b([A-Za-z]+\s*\([A-Z]+\)|[A-Z]+)(\s*-\d+[A-Z]+(?:,*\s*-\d+)*|\s*-\d+(?:,*\s*-\d+)*)(?:,*\s*and\s+(-\d+))?")
print('\n')
print(pattern.sub(lambda x: x.group(1) + f', {x.group(1)}'.join(map(str.strip, x.group(2).strip().split(','))) + (f' and {x.group(1)}{x.group(3)}' if x.group(3) else ''), string_a))


Solution

  • Using your pattern and code, you can add matching optional uppercase chars [A-Z]* at the end of the group in the second alternation.

    \b([A-Za-z]+\s*\([A-Z]+\)|[A-Z]+)(\s*-\d+[A-Z]+(?:,*\s*-\d+[A-Z]*)*|\s*-\d+(?:,*\s*-\d+)*)(?:,*\s*and\s+(-\d+))?
                                                               ^^^^^^
    

    Regex demo

    Example

    import re
    string_a = "The Proteinase-Activated Receptor-2 interskin (IS)-1ALPHA, -8, -97, -2ALPHA, and -3 was specific antagonist. The LL-37, -15, SAC-1, -7, and -21 in keratinocytes was good."
    pattern = re.compile(r"\b([A-Za-z]+\s*\([A-Z]+\)|[A-Z]+)(\s*-\d+[A-Z]+(?:,*\s*-\d+[A-Z]*)*|\s*-\d+(?:,*\s*-\d+)*)(?:,*\s*and\s+(-\d+))?")
    print(pattern.sub(lambda x: x.group(1) + f', {x.group(1)}'.join(map(str.strip, x.group(2).strip().split(','))) + (f' and {x.group(1)}{x.group(3)}' if x.group(3) else ''), string_a))
    

    Output

    The Proteinase-Activated Receptor-2 interskin (IS)-1ALPHA, interskin (IS)-8, interskin (IS)-97, interskin (IS)-2ALPHA and interskin (IS)-3 was specific antagonist. The LL-37, LL-15, SAC-1, SAC-7 and SAC-21 in keratinocytes was good.