Search code examples
pythonregexstringreplacepython-re

Replace the value with the previous occurrence of acronym using python regular expression module


I need to add the previous word to -number which had occurred before -number of the sentence. Please go through the input string and expected output string for more clarification. I have tried the .replace, .sub methods of regex with static way which is kind of manipulated output.

Input String:

The acnes stimulated the mRNA expression of interleukin (IL)-1, -8, LL-37, MMP-1, -2, -3, -9, and -13 in keratinocytes.

Expected Output String:

The acnes stimulated the mRNA expression of interleukin (IL)-1, interleukin (IL)-8, LL-37, MMP-1, MMP-2, MMP-3, MMP-9, and MMP-13 in keratinocytes.

Code:

import re
string_a = "The acnes stimulated the mRNA expression of interleukin (IL)-1, -8, LL-37, MMP-1, -2, -3, -9, and -13 in keratinocytes."
regex1 = re.findall(r"[a-z]+\s+\(+[A-Z]+\)+-\d+\,\s+-\d\,+", string_a)
regex2 = re.findall(r"[A-Z]+-\d+\,\s+-\d\,\s+-\d\,\s+-\d\,\s+[a-z]+\s+-\d+", string_a)

Solution

  • You can use

    import re
    string_a = "The acnes stimulated the mRNA expression of interleukin (IL)-1, -8, LL-37, MMP-1, -2, -3, -9, and -13 in keratinocytes."
    pattern = re.compile(r"\b([A-Za-z]+\s*\([A-Z]+\)|[A-Z]+)(\s*-\d+(?:,\s*-\d+)*)(?:,\s*and\s+(-\d+))?")
    print( pattern.sub(lambda x: x.group(1) + f', {x.group(1)}'.join(map(str.strip, x.group(2).strip().split(','))) + (f', and {x.group(1)}{x.group(3)}' if x.group(3) else ''), string_a) )
    # => The acnes stimulated the mRNA expression of interleukin (IL)-1, interleukin (IL)-8, LL-37, MMP-1, MMP-2, MMP-3, MMP-9, and MMP-13 in keratinocytes.
    

    See the Python demo and a regex demo.

    Details

    • \b - word boundary
    • ([A-Za-z]+\s*\([A-Z]+\)|[A-Z]+) - Capturing group 1: one or more ASCII letters, then zero or more whitespaces, (, one or more uppercase ASCII letters, and a ), OR one or more uppercase ASCII letters
    • (\s*-\d+(?:,\s*-\d+)*) - Capturing group 2: zero or more whitespaces, -, one or more digits, and then zero or more sequences of a comma, zero or more whitespaces, - and one or more digits
    • (?:,\s*and\s+(-\d+))? - an optional non-capturing group: a comma, zero or more whitespaces, and, one or more whitespaces, then a Capturing group 3: -, one or more digits.

    The Group 1 value is prepended to all Group 2 comma-separated numbers inside a lambda used as a replacement argument.

    If Group 3 matched, and+space+concatenated Group 1 and Group 3 values are appended.