import re
input_text = "((PERS) Marcos Sy y) ((PERS) Lucy) estuvieron ((VERB) jugando) sdds" #example 1
input_text = "ashsahghgsa ((PERS) María y Rosa ds) son alumnas de esa escuela y juegan juntas" #example 2
input_text = re.sub(
r"\(\(PERS\)" + r"((?:\w\s*)+(?:\sy\s(?:\w\s*)+)+)(?=\s*y\s*(?:\)|\())",
#lambda m: (f"((PERS)){m[1]}) y"),
lambda m: (f"((PERS)){m[1].replace(' y', ') y ((PERS)')}"),
input_text, re.IGNORECASE)
print(input_text) # --> output
I need to separate the content inside a ((PERS) )
tag if there is a " y "
or a " y)"
in between.
So get the " y"
or the " y "
out of the ((PERS) )
tag and the rest of the content (in case it finds as is the case in example 2
) left in another ((PERS) )
tag. I try with \s+y\s+?
and with \s+y\s+
To achieve the desired output, I tried with a regex to match all the names inside the ((PERS) )
tag that are separated by " y "
or " y)"
. For that I tried to use a positive lookahead to check for " y "
or " y)"
after each name, and then group all the names together. But this lookahead dont works well.
So get this output for each of the examples respectively
"((PERS) Marcos Sy) y ((PERS) Lucy) estuvieron ((VERB) jugando) sdds" #for example 1
"ashsahghgsa ((PERS) María) y ((PERS)Rosa ds) son alumnas de esa escuela y juegan juntas" #for example 2
This regex is for content that does or does have to start with a capital letter r"([A-Z][\wí]+\s*)"
although I think that in this case it would be better to simply use r"((?:\w\s*)+)"
since the content is already encapsulated.
You could just use 2 regexes which simplifies it a lot. First:
input_text = re.sub(
r"\(\(PERS\)\s+([\w\s]+)\s+y\)\s+\(\(PERS\)\s+([\w\s]+)\)",
lambda m: (f"((PERS) {m[1]}) y ((PERS) {m[2]})"),
input_text,
re.IGNORECASE)
This one covers your 1st use case and matches:
((PERS)
\s+
([\w\s]+)
, as I understand without any other characters like -
y)
y
: \(\(PERS\)\s+([\w\s]+)\)
Then we format both matched groups into ((PERS) {m[1]}) y ((PERS) {m[2]})
format.The 2nd part of solution is very similar, except it just matches the 2nd group inside the 1st parentheses:
input_text = re.sub(
r"\(\(PERS\)\s+([\w\s]+)\s+y\s+([\w\s]+)\)",
lambda m: (f"((PERS) {m[1]}) y ((PERS) {m[2]})"),
input_text,
re.IGNORECASE)
You could ofc do it with a much more convoluted regex and replacement lambda, but I see no point. This regex would work, for instance:
\(\(PERS\)\s+([\w\s]+)\s+(y|y\s+([\w\s]+))\)(\s+\(\(PERS\)\s+([\w\s]+)\))?
but then you'd need to cover for cases when there's group 1 and group 5 or otherwise use logic for group 1 and 3.