Search code examples
javascriptregexnon-greedy

Why does a simple .*? non-greedy regex greedily include additional characters before a match?


I have a very simple regex similar to this:

HOHO.*?_HO_

With this test string...

fiwgu_HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO___HO_fbguyev

  • I expect it to match just _HOHO___HO_ (shortest match, non-greedy)
  • Instead it matches _HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO___HO_ (longest match, looks greedy).

Why? How can I make it match the shortest match?

Adding and removing the ? gives the same result.

Edit - better test string that shows why [^HOHO] doesn't work: fiwgu_HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO_H_O_H_O_HO_fbguye


All I can think of is that maybe it is matching multiple times - but there's only one match for _HO_, so I don't understand why it isn't taking the shortest match that ends at the _HO_, discarding the rest.

I've browsed all the questions I can find with titles like "Non-greedy regex acts greedy", but they all seem to have some other problem.


Solution

  • I figured out a solution with some help from Regex lazy vs greedy confusion.

    In regex engines like the one used by Javascript (NFA engines I believe), non-greedy only gives you the match that is shortest going left to right - from the first left-hand match that fits to the nearest right-hand match.

    Where there are many left-hand matches for one right-hand match, it will always go from the first it reaches (which will actually give the longest match).

    Essentially, it goes through the string one character at a time asking "Are there matches from this character? If so, match the shortest and finish. If no, move to next character, repeat". I expected it to be "Are there matches anywhere in this string? If so, match the shortest of all of them".


    You can approximate a regex that is non-greedy in both directions by replacing the . with a negation meaning "not the left-side match". To negate a string like this requires negative lookaheads and non-capturing groups, but it's as simple as dropping the string into (?:(?!).). For example, (?:(?!HOHO).)

    For example, the equivalent of HOHO.*?_HO_ which is non-greedy on the left and right would be:

    HOHO(?:(?!HOHO).)*?_HO_

    So the regex engine is essentially going through each character like this:

    • HOHO - Does this match the left side?
    • (?:(?!HOHO).)* - If so, can I reach the right-hand side without any repeats of the left side?
    • _HO_ - If so, grab everything until the right-hand match
    • ? modifier on * or + - If there are multiple right-hand matches, choose the nearest one