When using PCRE2 on regex101, group 1 in \G(.*?(?:<[^>]*>|&[^;]*;)*+)*?(micro)
used on the string "<test><micro>" first matches the entire string (as expected) but (micro) then goes back and steals from the previously possessively matched content. Is this expected behavior? If so, why does it happen and how do i avoid it?
\G(.*?(?:<[^>]*>|&[^;]*;)*+)*?(micro)?
, so that (micro) does not need to be matched. This succeeded in the test case above but when run against the string "micro", it failed to capture the string in group 2The regex should match everything up the a "micro" that is neither in <> nor in &; and capture it in group 1, then capture the micro in group 2.
You could use
(?:<[^>]*>|&[^; ]*;)(*SKIP)(*F)|micro
The pattern matches:
(?:
Non capture group for the alternatives
<[^>]*>
Match from <...>
|
Or&[^; ]*;
Match &
and then optional chars other than ;
or space and then match
;`)
Close the non capture group(*SKIP)(*F)
Skip the match|
Ormicro
Match literallySee a regex demo