Summary of problem: I have written the generic regex to capture two groups from the sentence. Further I need to concatenate the 3rd term of 2nd group to the 1st group. I have used the word
and
in regex as partition to separate two groups of the sentence. For example:
Input = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'
Output = 'Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.'
What Regex I have tried:
import re
string_ = "Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin."
regex_pattern = re.compile(r"\b([A-Za-z]*-\d+\s*|[A-Za-z]+\s*)\s+(and\s*[A-Za-z]*-\d+\s*[A-Za-z]*|and\s*[A-Za-z]+\s*[A-Za-z]+)?")
print(regex_pattern.findall(string_))
print(regex_pattern.sub(lambda x: x.group(1) + x.group(2)[2], string_))
The regex is able to capture the groups but I'm getting error as TypeError: 'NoneType' object is not subscriptable
from the substitute
method line. Any kind of suggestions or help to execute the above problem will be appreciated.
While this is not a regex solution, this certainly works:
from string import punctuation
x = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'
x = x.split()
for idx, word in enumerate(x):
if word == "and":
# strip punctuation or we will get skin. instead of skin
x[idx] = x[idx + 2].strip(punctuation) + " and"
print(' '.join(x))
Output is:
Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.
This solution avoids inserting into the list directly as that would cause problems with indices as you iterate through. Instead, we replace the first "and" in the list with "synthesis and", and the second "and" with "skin and", and then rejoin the split string.
If you insist upon a regex solution, I suggest using re.findall
with a pattern containing a single and as this is more generalized for the problem:
from string import punctuation
import re
pattern = re.compile("(.*?)\sand\s(.*?)\s([^\s]+)")
result = ''.join([f"{match[0]} {match[2].strip(punctuation)} and {match[1]} {match[2]}" for match in pattern.findall(x)])
print(result)
Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.
Once again we use strip(punctuation)
because skin.
is captured: we don't want to lose the punctuation at the end of the sentence, but we do want to lose it inside the sentence.
Here is our pattern:
(.*?)\sand\s(.*?)\s([^\s]+)
(.*?)\s
: capture all content before the "and", including the space\s(.*?)\s
: capture the word immediately following the "and"([^\s]+)
: capture anything that is not a space up until the next space (ie. the second word after the "and"). This ensures we capture punctuation as well.