Search code examples
regexpcre

Regex grouping rules for alternation


How does a regular expression engine know that the following:

lions|starz|summit

Is grouped as:

(?:lions)|(?:starz)|(?:summit)

And not, something like:

lion(?:s|s)tar(?:z|s)ummit

Is there a rule about alternation that describes 'how far back' it should go or how is that determined?


Solution

  • The pipe symbol (|) is a special character, which cannot be used in the phrase you're trying to match unless you escape it.

    So, the regex engine reads characters that are not special, until it reaches one that is and decides what to do from there.

    In your case, it reads up to and including lions, then finds a special |, telling it there's an alternative following, which is starz, followed by another special |, indicating a next alternative.

    So, once it's done it knows it needs to match any of lions, starz and summit.

    The reason it doesn't mistake it for something like lion(?:s|s)tar(?:z|s)ummit is simply because you didn't provide it with the special characters (, ?, etc. - so there's no confusion.

    And there is no rule about "how far back" it needs to go, if anything it just keeps going forward from the first character until it hits something that tells it to stop. You could have 100s of alternatives and it would still match any of them.

    If you wanted to match a text that had a pipe symbol in it instead, you'd escape it with a backslash (\), like before\|after