Search code examples
pythonregexsplit

How to split and reorder the content inside the ((PERS)) tag by ' y ' or ' y)' using Python regular expressions?


import re

input_text = "((PERS) Marcos Sy y) ((PERS) Lucy) estuvieron ((VERB) jugando) sdds" #example 1
input_text = "ashsahghgsa ((PERS) María y Rosa ds) son alumnas de esa escuela y juegan juntas" #example 2

input_text = re.sub(
                    r"\(\(PERS\)" + r"((?:\w\s*)+(?:\sy\s(?:\w\s*)+)+)(?=\s*y\s*(?:\)|\())",
                    #lambda m: (f"((PERS)){m[1]}) y"),
                    lambda m: (f"((PERS)){m[1].replace(' y', ') y ((PERS)')}"),
                    input_text, re.IGNORECASE)

print(input_text) # --> output

I need to separate the content inside a ((PERS) ) tag if there is a " y " or a " y)" in between. So get the " y" or the " y " out of the ((PERS) ) tag and the rest of the content (in case it finds as is the case in example 2) left in another ((PERS) ) tag. I try with \s+y\s+? and with \s+y\s+

To achieve the desired output, I tried with a regex to match all the names inside the ((PERS) ) tag that are separated by " y " or " y)". For that I tried to use a positive lookahead to check for " y " or " y)" after each name, and then group all the names together. But this lookahead dont works well.

So get this output for each of the examples respectively

"((PERS) Marcos Sy) y ((PERS) Lucy) estuvieron ((VERB) jugando) sdds" #for example 1

"ashsahghgsa ((PERS) María) y ((PERS)Rosa ds) son alumnas de esa escuela y juegan juntas" #for example 2

This regex is for content that does or does have to start with a capital letter r"([A-Z][\wí]+\s*)" although I think that in this case it would be better to simply use r"((?:\w\s*)+)" since the content is already encapsulated.


Solution

  • You could just use 2 regexes which simplifies it a lot. First:

    input_text = re.sub(
      r"\(\(PERS\)\s+([\w\s]+)\s+y\)\s+\(\(PERS\)\s+([\w\s]+)\)",
      lambda m: (f"((PERS) {m[1]}) y ((PERS) {m[2]})"),
      input_text,
      re.IGNORECASE)
    

    This one covers your 1st use case and matches:

    • ((PERS)
    • followed by some whitespace \s+
    • some mixed word characters and whitespaces that get captured ([\w\s]+), as I understand without any other characters like -
    • some more whitespaces until y)
    • then again the same except without y: \(\(PERS\)\s+([\w\s]+)\) Then we format both matched groups into ((PERS) {m[1]}) y ((PERS) {m[2]}) format.

    The 2nd part of solution is very similar, except it just matches the 2nd group inside the 1st parentheses:

    input_text = re.sub(
      r"\(\(PERS\)\s+([\w\s]+)\s+y\s+([\w\s]+)\)",
      lambda m: (f"((PERS) {m[1]}) y ((PERS) {m[2]})"),
      input_text,
      re.IGNORECASE)
    

    You could ofc do it with a much more convoluted regex and replacement lambda, but I see no point. This regex would work, for instance: \(\(PERS\)\s+([\w\s]+)\s+(y|y\s+([\w\s]+))\)(\s+\(\(PERS\)\s+([\w\s]+)\))? but then you'd need to cover for cases when there's group 1 and group 5 or otherwise use logic for group 1 and 3.