Search code examples
pythonregexstringregex-grouppython-re

Concatenate the term using substitute method via regex


Summary of problem: I have written the generic regex to capture two groups from the sentence. Further I need to concatenate the 3rd term of 2nd group to the 1st group. I have used the word and in regex as partition to separate two groups of the sentence. For example:

Input = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'

Output = 'Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.'

What Regex I have tried:

import re
string_ = "Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin." 
regex_pattern = re.compile(r"\b([A-Za-z]*-\d+\s*|[A-Za-z]+\s*)\s+(and\s*[A-Za-z]*-\d+\s*[A-Za-z]*|and\s*[A-Za-z]+\s*[A-Za-z]+)?")
print(regex_pattern.findall(string_))
print(regex_pattern.sub(lambda x: x.group(1) + x.group(2)[2], string_))

The regex is able to capture the groups but I'm getting error as TypeError: 'NoneType' object is not subscriptable from the substitute method line. Any kind of suggestions or help to execute the above problem will be appreciated.


Solution

  • Split solution

    While this is not a regex solution, this certainly works:

    from string import punctuation
    
    x = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'
    x = x.split()
    for idx, word in enumerate(x):
        if word == "and":
            # strip punctuation or we will get skin. instead of skin
            x[idx] = x[idx + 2].strip(punctuation) + " and"
    print(' '.join(x))
    

    Output is:

    Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.

    This solution avoids inserting into the list directly as that would cause problems with indices as you iterate through. Instead, we replace the first "and" in the list with "synthesis and", and the second "and" with "skin and", and then rejoin the split string.

    Regex solution

    If you insist upon a regex solution, I suggest using re.findall with a pattern containing a single and as this is more generalized for the problem:

    from string import punctuation
    import re
    pattern = re.compile("(.*?)\sand\s(.*?)\s([^\s]+)")
    result = ''.join([f"{match[0]} {match[2].strip(punctuation)} and {match[1]} {match[2]}" for match in pattern.findall(x)])
    print(result)
    

    Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.

    Once again we use strip(punctuation) because skin. is captured: we don't want to lose the punctuation at the end of the sentence, but we do want to lose it inside the sentence.

    Here is our pattern:

    (.*?)\sand\s(.*?)\s([^\s]+)
    
    1. (.*?)\s: capture all content before the "and", including the space
    2. \s(.*?)\s: capture the word immediately following the "and"
    3. ([^\s]+): capture anything that is not a space up until the next space (ie. the second word after the "and"). This ensures we capture punctuation as well.