Search code examples
javaregexmultilinerecords

Java regex to match multiline records starting with fixed label


Following is an example of a list of multiline records, each starting with a fixed string label (LABEL):

<Irrelevant line>
...
<Irrelevant line>
LABEL ...
...
...
LABEL ...
...
...
LABEL ...
...
...
LABEL ...
...
...

Is there a Java regular expression that can much the above and extract each record, i.e.

LABEL ...
...
...

Also, is this the fastest way of extracting those records, or reading line-by-line and checking the start of the string would yield faster results?


Solution

  • To iterate over all the LABEL groups, use this:

    Pattern regex = Pattern.compile("(?sm)LABEL.*?(?=^LABEL|\\Z)");
    Matcher regexMatcher = regex.matcher(subjectString);
    while (regexMatcher.find()) {
        // the current LABEL group: regexMatcher.group()
    } 
    

    See the demo for the various matches.

    Explanation

    • (?s) activates DOTALL mode, allowing the dot to match across lines
    • (?m) turns on multi-line mode, allowing ^ and $ to match on each line
    • LABEL matches literal characters
    • .*? lazily matches all chars up to...
    • the point where the lookahead (?=^LABEL|\\Z) can assert that what follows is the next LABEL or the end of the string