Key to this problem is making sure to use re.findall NOT using lookarounds and not in multiline mode. This is partially because I also want to port it to use with regex libraries that don't support lookarounds.
Say I have the string below:
>>> a = '''bleh blee BLOO
GOO ruu bum LUM Tum
sss ddf GHH rty
[[[BREAK]]]
gumpty RUMPTY BOBBY
JOE low blow
[[[BREAK]]]
BEEP boop bob
yellow green tam nim
reese yob
[[[BREAK]]]
'''
What I want to do is use re.findall to capture everything that is not "\n\n\[\[\[BREAK\]\]\]\n\n" without using lookarounds nor in multiline mode and yes I want the double \n's to be part of the excluded string.
The desired OUTPUT is as follows:
>>> b[0]
'bleh blee BLOO\nGOO ruu bum LUM Tum\nsss ddf GHH rty'
>>> b[1]
'gumpty RUMPTY BOBBY\nJOE low blow'
>>> b[2]
'BEEP boop bob\nyellow green tam nim\nreese yob'
I'm well aware that I can use split() and re.split(), but I want to get a more pure understanding of how to properly write the regex for this because I'm sure it will come up in the future.
.
What's grinding my gears is that even in terms of lookarounds I'm having a problem doing this without cheating -- below I'm telling it to capture every character string that doesn't contain a "\[" before my ignored string, but that doesn't account for the possibility that "\[" may be present:
>>> b = re.findall('[^\[]+(?=\n\n\[\[\[BREAK\]\]\]\n\n)', a)
OUTPUT
>>> b[0]
'bleh blee BLOO\nGOO ruu bum LUM Tum\nsss ddf GHH rty'
>>> b[1]
'gumpty RUMPTY BOBBY\nJOE low blow'
>>> b[2]
'BEEP boop bob\nyellow green tam nim\nreese yob'
Can anyone provide insight? Actually, even an improvement on my lookaround portion may be invited in order to give me a better understanding of that as well.
Ok, I think you can do it this way.
(?:^(?:\n\n\[\[\[BREAK\]\]\]\n\n)+)?([\S\s]*?)(?:(?:\n\n\[\[\[BREAK\]\]\]\n\n)+|$)
You have to match the stuff you don't want in order to move the current position
past it. That's just the way it is.
Expanded
(?:
^
(?: \n\n \[\[\[BREAK\]\]\] \n\n )+
)?
( [\S\s]*? ) # (1)
(?:
(?: \n\n \[\[\[BREAK\]\]\] \n\n )+
| $
)