How to write a regexp that will match all the multi line sections (with different amount of lines) that start with a given identifier (Until receiving an end of message keyword).
Example - I want to extract all sections that start with keyword 'START' up until 'END_OF_MSG' from a given text block:
HELLO
START ABC DEF GHI JKL
QWER RANDOM TEXT 213%@#!
UIOP RANDOMZXCVB123456
START ABC DEF GHI JKL
ZZZZZ RANDOMTEXT213%@#!
11111 RANDOMZXCVB123456
$$$$$$ SOMEMORETEXT
START ABC DEF GHI JKL
QWER RANDOMTEXT213%@#!
$$$$$ RANDOMZXCVB123456
END_OF_MSG
I'd like the regexp to produce three sections:
START ABC DEF GHI JKL
QWER RANDOM TEXT 213%@#!
UIOP RANDOMZXCVB123456
START ABC DEF GHI JKL
ZZZZZ RANDOMTEXT213%@#!
11111 RANDOMZXCVB123456
$$$$$$ SOMEMORETEXT
START ABC DEF GHI JKL
QWER RANDOMTEXT213%@#!
$$$$$ RANDOMZXCVB123456
So far i've worked out a regexp which seems to do this almost correctly
(?m)^START(.|\n)*?((?=^START)|END_OF_MSG)
The issue is, that the last section also includes the END_OF_MSG identifier which i'd like to skip. I also think that this regexp does not look like the most optimal way of grabbing those sections. Any ideas on how to improve this?
Example available here: Regex101
You can match START
followed by the rest of the line, and match all following lines that do not start with START
of END_OF_MSG
using a negative lookahead.
^START\b.*(?:\R(?!START\b|END_OF_MSG\b).*)*
Explanation
^
Start of stringSTART\b.*
Match START, a word boundary and the rest of the line(?:
Non capture group
\R
Match a newline sequence(?!START\b|END_OF_MSG\b).*
Match the whole line if it does not start with any of the alternatives using a negative lookahead)*
Close the group and repeat it 0+ times to match all the linesIn Java with doubled backslashes
^START\\b.*(?:\\R(?!START\\b|END_OF_MSG\\b).*)*