Search code examples
regexpcreregex-group

Is it possible to erase a capture group that has already matched, making it non-participating?


In PCRE2 or any other regex engine supporting forward backreferences, is it possible to change a capture group that matched in a previous iteration of a loop into a non-participating capture group (also known as an unset capture group or non-captured group), causing conditionals that test that group to match with their "false" clause rather than their "true" clause?

For example, take the following PCRE regex:

^(?:(z)?(?(1)aa|a)){2}

When fed the string zaazaa, it matches the whole string, as desired. But when fed zaaaa, I would like it to match zaaa; instead, it matches zaaaa, the whole string. (This is just for illustration. Of course this example could be handled by ^(?:zaa|a){2} but that is beside the point. Practical usage of capture group erasure would tend to be in loops that most often do far more than 2 iterations.)

An alternative way of doing this, which also doesn't work as desired:

^(?:(?:z()|())(?:\1aa|\2a)){2}

Note that both of these work as desired when the loop is "unrolled", because they no longer have to erase a capture that has already been made:

^(?:(z)?(?(1)aa|a))(?:(z)?(?(2)aa|a))
^(?:(?:z()|())(?:\1aa|\2a))(?:(?:z()|())(?:\3aa|\4a))

So instead of being able to use the simplest form of conditional, a more complicated one must be used, which only works in this example because the "true" match of z is non-empty:

^(?:(z?)(?(?!.*$\1)aa|a)){2}

Or just using an emulated conditional:

^(?:(z?)(?:(?!.*$\1)aa|(?=.*$\1)a)){2}

I have scoured all the documentation I can find, and there seems not to even be any mention or explicit description of this behavior (that captures made within a loop persist through iterations of that loop even when they fail to be re-captured).

It's different than what I intuitively expected. The way I would implement it is that evaluating a capture group with 0 repetitions would erase/unset it (so this could happen to any capture group with a *, ?, or {0,N} quantifier), but skipping it due to being in a parallel alternative within the same group in which it gained a capture during a previous iteration would not erase it. Thus, this regex would still match words iff they contain at least one of every vowel:

\b(?:a()|e()|i()|o()|u()|\w)++\1\2\3\4\5\b

But skipping a capture group due to it being inside an unevaluated alternative of a group that is evaluated with nonzero repetitions which is nested within the group in which the capture group took on a value during a previous iteration would erase/unset it, so this regex would be able to either capture or erase group \1 on every iteration of the loop:

^(?:(?=a|(b)).(?(1)_))*$

and would match strings such as aaab_ab_b_aaaab_ab_aab_b_b_aaa. However, the way forward references are actually implemented in existing engines, it matches aaaaab_a_b_a_a_b_b_a_b_b_b_.

I would like to know the answer to this question not merely because it would be useful in constructing regexes, but because I have written my own regex engine, currently ECMAScript-compatible with some optional extensions (including molecular lookahead (?*), i.e. non-atomic lookahead, which as far as I know, no other engine has), and I would like to continue adding features from other engines, including forward/nested backreferences. Not only do I want my implementation of forward backreferences to be compatible with existing implementations, but if there isn't a way of erasing capture groups in other engines, I will probably create a way of doing it in my engine that doesn't conflict with other existing regex features.

To be clear: An answer stating that this is not possible in any mainstream engines will be acceptable, as long as it is backed up by adequate research and/or citing of sources. An answer stating that it is possible would be much easier to state, since it would require only one example.

Some information on what a non-participating capture group is:
http://blog.stevenlevithan.com/archives/npcg-javascript - this is the article that originally introduced me to the idea.
https://www.regular-expressions.info/backref2.html - the first section on this page gives a brief explanation.
In ECMAScript/Javascript regexes, backreferences to NPCGs always match (making a zero-length match). In pretty much every other regex flavor, they fail to match anything.


Solution

  • This is partially possible in .NET's flavour of regex.

    The first thing to note is that .NET records all of the captures for a given capture group, not just the latest. For instance, ^(?=(.)*) records each character in the first line as a separate capture in the group.

    To actually delete captures, .NET regex has a construction known as balancing groups. The full format of this construction is (?<name1-name2>subexpression).

    • First, name2 must have previously been captured.
    • The subexpression must then match.
    • If name1 is present, the substring between the end of the capture of name2 and the start of the subexpression match is captured into name1.
    • The latest capture of name2 is then deleted. (This means that the old value could be backreferenced in the subexpression.)
    • The match is advanced to the end of the subexpression.

    If you know you have name2 captured exactly once then it can readily be deleted using (?<-name2>); if you don't know whether you have name2 captured then you could use (?>(?<-name2>)?) or a conditional. The problem arises if you might have name2 captured more than once since then it depends on whether you can organise enough repetitions of the deletion of name2. ((?<-name2>)* doesn't work because * is equivalent to ? for zero-length matches.)