Search code examples
regexalternation

Why regex engine choose to match pattern `..X` from `.X|..X|X.`?


I have a string

1234X5678

and I use this regex to match pattern

.X|..X|X.

I got

34X

The question is why didn't I get 4X or X5?

Why regex choose to perform the second pattern?


Solution

  • The main point here is:

    Regex engine analyzes the input from LEFT TO RIGHT by default.

    So, you have an alternation pattern .X|..X|X. and you run it against 1234X5678. See what happens:

    enter image description here

    Each alternative branch is tested against each location in the string from left to right.

    The first 1-7 steps show how the engine tries to match the characters at the beginning of the string. However, none of the branches (neither .X, nor ..X, nor X. match 12 or 123).

    Steps 8-13 just repeat the same failing scenario as none of the branches match 23 or 234.

    Steps 14-19 show a success scenario because the 34X can be matched with Branch 2 (..X).

    The regex engine does not reach the location before 4 since this location gets matched and consumed.

    And another conclusion:

    The order of alternations matters, and in NFA regex engines the first alternative matched wins, BUT this alternative does not have to be the first shortest one, a farther longer alternative that matches the same characters in the beginning can match earlier.