I'm struggling in creating a regex to capture what's included between two keywords in a multi-line file.
In particular, consider the following file:
#%META
# date: 2022-08-27
# generated-by: Me
# id: 1
#%ENDS
#%BODY
....
#%ENDS
#%META
# date: 2022-08-27
# generated-by: Another Me
# id: 2
#%ENDS
#%BODY
....
#%ENDS
I wanted to parse what is included between the #%META
and the #%ENDS
keywords, if possible, without the leading #
, i.e., the desired result is to capture both:
date: 2022-08-27
generated-by: Me
id: 1
and
date: 2022-08-27
generated-by: Another Me
id: 2
I come out with following regex: (?<=#%META\n)([\S\s]*?)(?=#%ENDS\n)
.
However this is not capable to identify the two chuncks of text to be matched as well as does not remove the leading #
.
Could anyone help in that?
Thank's a lot! :)
You might use a pattern to first capture all the parts between #%META
and #%ENDS
and then after process the capture group 1 values removing the leading #
followed by optional spaces.
^#%META((?>\R(?!#%(?:META|ENDS)$).*)+)\R#%ENDS$
Explanation
^
Start of string#%META
Match literally(
Capture group 1
(?>
Atomic group
\R
Match any unicode newline sequence(?!#%(?:META|ENDS)$)
Negative lookahead, assert that the line is not #%META
or #%ENDS
.*
Match the whole line)+
Close the atomic group and repeat 1+ times)
Close group 1\R
Match any unicode newline sequence#%ENDS
Match literally$
End of stringExample
$re = '/^#%META((?>\R(?!#%(?:META|ENDS)$).*)+)\R#%ENDS$/m';
$str = '#%META
# date: 2022-08-27
# generated-by: Me
# id: 1
#%ENDS
#%BODY
....
#%ENDS
#%META
# date: 2022-08-27
# generated-by: Another Me
# id: 2
#%ENDS
#%BODY
....
#%ENDS';
if (preg_match_all($re, $str, $matches)) {
$result = array_map(function ($s) {
return preg_replace("/^#\h*/m", "", trim($s));
}, $matches[1]);
var_export($result);
}
Output
array (
0 => 'date: 2022-08-27
generated-by: Me
id: 1',
1 => 'date: 2022-08-27
generated-by: Another Me
id: 2',
)