Search code examples
pythonregexregex-lookaroundsmultiline

Python re.findall without lookarounds and not multiline mode (AND NOT SPLIT()) to capture everything besides specified string


Key to this problem is making sure to use re.findall NOT using lookarounds and not in multiline mode. This is partially because I also want to port it to use with regex libraries that don't support lookarounds.

Say I have the string below:

>>> a = '''bleh blee BLOO
GOO ruu bum LUM Tum
sss ddf GHH rty

[[[BREAK]]]

gumpty RUMPTY BOBBY 
JOE low blow

[[[BREAK]]]

BEEP boop bob
yellow green tam nim
reese yob

[[[BREAK]]]

'''

What I want to do is use re.findall to capture everything that is not "\n\n\[\[\[BREAK\]\]\]\n\n" without using lookarounds nor in multiline mode and yes I want the double \n's to be part of the excluded string.

The desired OUTPUT is as follows:

>>> b[0]
'bleh blee BLOO\nGOO ruu bum LUM Tum\nsss ddf GHH rty'
>>> b[1]
'gumpty RUMPTY BOBBY\nJOE low blow'
>>> b[2]
'BEEP boop bob\nyellow green tam nim\nreese yob'

I'm well aware that I can use split() and re.split(), but I want to get a more pure understanding of how to properly write the regex for this because I'm sure it will come up in the future.

.

What's grinding my gears is that even in terms of lookarounds I'm having a problem doing this without cheating -- below I'm telling it to capture every character string that doesn't contain a "\[" before my ignored string, but that doesn't account for the possibility that "\[" may be present:

>>> b = re.findall('[^\[]+(?=\n\n\[\[\[BREAK\]\]\]\n\n)', a)

OUTPUT

>>> b[0]
'bleh blee BLOO\nGOO ruu bum LUM Tum\nsss ddf GHH rty'
>>> b[1]
'gumpty RUMPTY BOBBY\nJOE low blow'
>>> b[2]
'BEEP boop bob\nyellow green tam nim\nreese yob'

Can anyone provide insight? Actually, even an improvement on my lookaround portion may be invited in order to give me a better understanding of that as well.


Solution

  • Ok, I think you can do it this way.
    (?:^(?:\n\n\[\[\[BREAK\]\]\]\n\n)+)?([\S\s]*?)(?:(?:\n\n\[\[\[BREAK\]\]\]\n\n)+|$)

    You have to match the stuff you don't want in order to move the current position
    past it. That's just the way it is.

    Expanded

     (?:
          ^
          (?: \n\n \[\[\[BREAK\]\]\] \n\n )+
     )?
     ( [\S\s]*? )                  # (1)
     (?:
          (?: \n\n \[\[\[BREAK\]\]\] \n\n )+
       |  $ 
     )