Search code examples
phpregexlazy-evaluationpcre

regex lazy capturing fails after lazy "anything" ... unless I know what follows


In my PCRE engine (and on regex101) I can use this regex:

abc [\s\S]*?(mno)?

run against this string

abc random mno unknowable

it does not return the mno ... unless I add to the end of the regex, like so:

abc [\s\S]*?(mno)? un

Please note that I have simplified the regex to show the problem ... but this comes from a "real life" regex interpreting extracted text from a business document. The key point is that there must be some text (the abc) otherwise the regex must fail ... then there can be some more text and an optional mno ... but it is impossible to guess the "random" or the "unknowable" in-between and following.

The solution has eluded me over several days of trying. I'm hoping smarter people than me will know the answer.


Solution

  • It looks like you're trying to capture an optional text mno from a pattern.

    If so, you can try using a negative lookahead (?!mno) to make sure [\s\S] won't consume the text you are looking for mno accidentally while in greedy mode.

    In fact, you should not use non-greedy mode while the text you want to capture is optional - it'll simply not match anything.

    Here's a working solution:

    abc (?:(?!mno)[\s\S])*(mno)?
    

    See the proof.


    Edit

    If mno is a complicated pattern, you can still find the optional mno. But this time, instead of mno, you should exclude the certain part abc for the negative lookahead, so it won't cross another abc section while searching for the complicated pattern mno

    Notice the ? quantifier is on the whole thing that contains mno, not mno itself:

    abc (?:(?:(?!abc )[\s\S])*(mno))?(?:(?!abc )[\s\S])*
    

    This case it can still capture the complicated mno.

    See the proof