Search code examples
regexpcre

Match every item in between two words in a comma separated list


I have a comma separated numbers, I want to match every item after START or before END if any of the keyword exists.

I got most of the test cases correctly using

(?:.*?START|END.*)(*SKIP)(*F)|\d+

except those that START appears after END or multiple instances of START and END exist.

input matches
123,45678,789,777,888,1234 123,45678,789,777,888,1234
123,START,789,777,888,1234 789,777,888,1234
123,45678,789,777,END,1234 123,45678,789,777
123,START,789,777,END,1234 789,777
123,END,789,777,START,1234 123
123,START,789,START,777,END,1234 789,777
123,START,789,END,777,END,1234 789
123,END,789,START,777,END,1234 123

Here's the regex101 project I was trying, I'm using PCRE2(PHP7.3).


Solution

  • You might fix your pattern by adding a restriction to find START that has no END before it:

    (?:^(?:(?!END).)*?START|END.*)(*SKIP)(*F)|\d+
    // ^^^^^^^^^^^^^^^
    

    See the regex demo.

    Here, ^(?:(?!END).)*?START (instead of .*?START) matches

    • ^ - start of string
    • (?:(?!END).)*? - any char, other than line break chars, as few as possible, that does not start an END char sequence
    • START - a START char sequence.

    You can also use

    (?:\G(?!\A)|^(?:(?:(?!END).)*?START)?)(?:(?!END).)*?\K\d+
    

    See the regex demo.

    Details:

    • (?:\G(?!\A)|^(?:(?:(?!END).)*?START)?) - either the end of the previous successful match (\G(?!\A)) or (|) start of a string (^) and then an optional occurrence of any text up to the first occurrence of START that is not preceded with END ((?:(?:(?!END).)*?START)?)
    • (?:(?!END).)*? - any char, other than line break chars, zero or more times but as few as possible, that does not start an END char sequence
    • \K - match reset operator that discards all text matched so far from the overall match memory buffer
    • \d+ - one or more digits.