I have a string
1234X5678
and I use this regex to match pattern
.X|..X|X.
I got
34X
The question is why didn't I get 4X
or X5
?
Why regex choose to perform the second pattern?
The main point here is:
Regex engine analyzes the input from LEFT TO RIGHT by default.
So, you have an alternation pattern .X|..X|X.
and you run it against 1234X5678
. See what happens:
Each alternative branch is tested against each location in the string from left to right.
The first 1-7 steps show how the engine tries to match the characters at the beginning of the string. However, none of the branches (neither .X
, nor ..X
, nor X.
match 12
or 123
).
Steps 8-13 just repeat the same failing scenario as none of the branches match 23
or 234
.
Steps 14-19 show a success scenario because the 34X
can be matched with Branch 2 (..X
).
The regex engine does not reach the location before 4
since this location gets matched and consumed.
And another conclusion:
The order of alternations matters, and in NFA regex engines the first alternative matched wins, BUT this alternative does not have to be the first shortest one, a farther longer alternative that matches the same characters in the beginning can match earlier.