Search code examples
regexpcre2

Expression steals from previously possessively matched characters


When using PCRE2 on regex101, group 1 in \G(.*?(?:<[^>]*>|&[^;]*;)*+)*?(micro) used on the string "<test><micro>" first matches the entire string (as expected) but (micro) then goes back and steals from the previously possessively matched content. Is this expected behavior? If so, why does it happen and how do i avoid it?

What i tried:

  • Putting the possessive modifier in different places (the .* at the start needs to be able to surrender characters)
  • changing the regex to \G(.*?(?:<[^>]*>|&[^;]*;)*+)*?(micro)?, so that (micro) does not need to be matched. This succeeded in the test case above but when run against the string "micro", it failed to capture the string in group 2

What i expect:

The regex should match everything up the a "micro" that is neither in <> nor in &; and capture it in group 1, then capture the micro in group 2.


Solution

  • You could use

    (?:<[^>]*>|&[^; ]*;)(*SKIP)(*F)|micro
    

    The pattern matches:

    • (?: Non capture group for the alternatives
      • <[^>]*> Match from <...>
      • | Or
      • &[^; ]*; Match & and then optional chars other than ; or space and then match ;`
    • ) Close the non capture group
    • (*SKIP)(*F) Skip the match
    • | Or
    • micro Match literally

    See a regex demo