Search code examples
phpregexregex-group

Regex capture multi-line groups


I'm struggling in creating a regex to capture what's included between two keywords in a multi-line file.

In particular, consider the following file:

#%META
# date: 2022-08-27
# generated-by: Me
# id: 1
#%ENDS

#%BODY
....
#%ENDS

#%META
# date: 2022-08-27
# generated-by: Another Me
# id: 2
#%ENDS

#%BODY
....
#%ENDS

I wanted to parse what is included between the #%META and the #%ENDS keywords, if possible, without the leading #, i.e., the desired result is to capture both:

date: 2022-08-27
generated-by: Me
id: 1

and

date: 2022-08-27
generated-by: Another Me
id: 2

I come out with following regex: (?<=#%META\n)([\S\s]*?)(?=#%ENDS\n).

However this is not capable to identify the two chuncks of text to be matched as well as does not remove the leading #.

Could anyone help in that?

Thank's a lot! :)


Solution

  • You might use a pattern to first capture all the parts between #%META and #%ENDS and then after process the capture group 1 values removing the leading # followed by optional spaces.

    ^#%META((?>\R(?!#%(?:META|ENDS)$).*)+)\R#%ENDS$
    

    Explanation

    • ^ Start of string
    • #%META Match literally
    • ( Capture group 1
      • (?> Atomic group
        • \R Match any unicode newline sequence
        • (?!#%(?:META|ENDS)$) Negative lookahead, assert that the line is not #%META or #%ENDS
        • .* Match the whole line
      • )+ Close the atomic group and repeat 1+ times
    • ) Close group 1
    • \R Match any unicode newline sequence
    • #%ENDS Match literally
    • $ End of string

    Regex demo | PHP demo

    Example

    $re = '/^#%META((?>\R(?!#%(?:META|ENDS)$).*)+)\R#%ENDS$/m';
    $str = '#%META
    # date: 2022-08-27
    # generated-by: Me
    # id: 1
    #%ENDS
    
    #%BODY
    ....
    #%ENDS
    
    #%META
    # date: 2022-08-27
    # generated-by: Another Me
    # id: 2
    #%ENDS
    
    #%BODY
    ....
    #%ENDS';
    
    if (preg_match_all($re, $str, $matches)) {
        $result = array_map(function ($s) {
            return preg_replace("/^#\h*/m", "", trim($s));
        }, $matches[1]);
        var_export($result);
    }
    

    Output

    array (
      0 => 'date: 2022-08-27
    generated-by: Me
    id: 1',
      1 => 'date: 2022-08-27
    generated-by: Another Me
    id: 2',
    )