Search code examples
regexregex-lookaroundsregex-greedy

RegEx non-greedy quantifier .*? not working as expected


I am trying to create a regex to match "section 2 foo 2019 foo" in the following string (a minimal example, not the real thing):

section 1 bar bar section 2 foo 2019 foo section 3 bar 2021 bar end 

(string "section", followed by a number, followed by any text, followed by a 4-digit year, followed again by any text)

My initial thought was to use non-greedy quantifiers and one capturing and one non-capturing group, like this:

(section [0-9]{1}.*?(19|20)[0-9]{2}.*?)(?:section)

However, this will produce the following match for the capturing group:

section 1 bar section 2 foo 2019 foo

So, it also matches section 1, which I want to exclude.

After some background reading, I understand that the problem here is that "non-greedy" does not actually mean "match the shortest possible string", but instead it means "match the shortest possible string reading from left to right without backtracing".

There are a few answers here on SO regarding this problem, but I am still struggling to find the right regex for this particular case. I tried using a non-capturing group with negative lookahead, like this:

section [0-9]{1,2}(?:(?!section [0-9]{1}).).*(?!202[1-9]{1})[0-9]{4} .*?

But this will still match the first unwanted section, unexpectedly. Any idea where my thinking might be wrong?


Solution

  • The issue here is that using .*? for the "any" text part still has the potential to match across sections which either match or do not match, until finding a closing year. Your final regex which attempts to use a tempered dot is on the right track. Consider this version:

    \bsection \d+ (?:(?!\bsection \d+).)*?(?:19|20)\d{2}\b
    

    Demo

    Explanation:

    \bsection \d+             match "section" followed by a number and space
    (?:(?!\bsection \d+).)*?  match any content, without crossing over to another section
    (?:19|20)\d{2}\b          match a 4 digit year