Search code examples
phpregexregex-alternation

Confusion Within an Alternation


Suppos that within a regex, if match one alternative from an alternation it stop right there even if still more alternatives left (there are no other tokens in the regex outside the alternation).

Source

This pattern that search one double word (e.g., this this)

\b([a-z]+)((?:\s|<[^>]+>)+)(\1\b)

I have one confusion if I introduce this subject:

It match with the patern.

"<i>whatever<i>         whatever"

\b([a-z]+) Match

((?:<[^>]+>|\s)+) Follows one TAG, so the 2nd alternative match.

(\1\b) Have to match if follows the same word backreferenced in the first parentheses.

Why match if after the tag not follows the '(\1\b)', follows whitespaces.

I know that within the alternation exist \s.

But is not supposed that the TAG match consume the alternation?

Why the \s alternative still alive?


Solution

  • The alternation is controlled by a + quantifier:

    (?:\s|<[^>]+>)+
    

    ...so it tries to match multiple times. Each time, it may try both alternatives: first \s, and if that fails, <[^>]+>.

    The first time, \s fails to match, but <[^>]+> succeeds in matching <i>.

    The second time, \s matches one space.

    The third time, \s matches another space.

    ...and so on, until all the spaces are consumed.