In PCRE2 or any other regex engine supporting forward backreferences, is it possible to change a capture group that matched in a previous iteration of a loop into a non-participating capture group (also known as an unset capture group or non-captured group), causing conditionals that test that group to match with their "false" clause rather than their "true" clause?
For example, take the following PCRE regex:
^(?:(z)?(?(1)aa|a)){2}
When fed the string zaazaa
, it matches the whole string, as desired. But when fed zaaaa
, I would like it to match zaaa
; instead, it matches zaaaa
, the whole string. (This is just for illustration. Of course this example could be handled by ^(?:zaa|a){2}
but that is beside the point. Practical usage of capture group erasure would tend to be in loops that most often do far more than 2 iterations.)
An alternative way of doing this, which also doesn't work as desired:
^(?:(?:z()|())(?:\1aa|\2a)){2}
Note that both of these work as desired when the loop is "unrolled", because they no longer have to erase a capture that has already been made:
^(?:(z)?(?(1)aa|a))(?:(z)?(?(2)aa|a))
^(?:(?:z()|())(?:\1aa|\2a))(?:(?:z()|())(?:\3aa|\4a))
So instead of being able to use the simplest form of conditional, a more complicated one must be used, which only works in this example because the "true" match of z
is non-empty:
^(?:(z?)(?(?!.*$\1)aa|a)){2}
Or just using an emulated conditional:
^(?:(z?)(?:(?!.*$\1)aa|(?=.*$\1)a)){2}
I have scoured all the documentation I can find, and there seems not to even be any mention or explicit description of this behavior (that captures made within a loop persist through iterations of that loop even when they fail to be re-captured).
It's different than what I intuitively expected. The way I would implement it is that evaluating a capture group with 0 repetitions would erase/unset it (so this could happen to any capture group with a *
, ?
, or {0,N}
quantifier), but skipping it due to being in a parallel alternative within the same group in which it gained a capture during a previous iteration would not erase it. Thus, this regex would still match words iff they contain at least one of every vowel:
\b(?:a()|e()|i()|o()|u()|\w)++\1\2\3\4\5\b
But skipping a capture group due to it being inside an unevaluated alternative of a group that is evaluated with nonzero repetitions which is nested within the group in which the capture group took on a value during a previous iteration would erase/unset it, so this regex would be able to either capture or erase group \1
on every iteration of the loop:
^(?:(?=a|(b)).(?(1)_))*$
and would match strings such as aaab_ab_b_aaaab_ab_aab_b_b_aaa
. However, the way forward references are actually implemented in existing engines, it matches aaaaab_a_b_a_a_b_b_a_b_b_b_
.
I would like to know the answer to this question not merely because it would be useful in constructing regexes, but because I have written my own regex engine, currently ECMAScript-compatible with some optional extensions (including molecular lookahead (?*)
, i.e. non-atomic lookahead, which as far as I know, no other engine has), and I would like to continue adding features from other engines, including forward/nested backreferences. Not only do I want my implementation of forward backreferences to be compatible with existing implementations, but if there isn't a way of erasing capture groups in other engines, I will probably create a way of doing it in my engine that doesn't conflict with other existing regex features.
To be clear: An answer stating that this is not possible in any mainstream engines will be acceptable, as long as it is backed up by adequate research and/or citing of sources. An answer stating that it is possible would be much easier to state, since it would require only one example.
Some information on what a non-participating capture group is:
http://blog.stevenlevithan.com/archives/npcg-javascript - this is the article that originally introduced me to the idea.
https://www.regular-expressions.info/backref2.html - the first section on this page gives a brief explanation.
In ECMAScript/Javascript regexes, backreferences to NPCGs always match (making a zero-length match). In pretty much every other regex flavor, they fail to match anything.
This is partially possible in .NET's flavour of regex.
The first thing to note is that .NET records all of the captures for a given capture group, not just the latest. For instance, ^(?=(.)*)
records each character in the first line as a separate capture in the group.
To actually delete captures, .NET regex has a construction known as balancing groups. The full format of this construction is (?<name1-name2>subexpression)
.
name2
must have previously been captured.name1
is present, the substring between the end of the capture of name2
and the start of the subexpression match is captured into name1
.name2
is then deleted. (This means that the old value could be backreferenced in the subexpression.)If you know you have name2
captured exactly once then it can readily be deleted using (?<-name2>)
; if you don't know whether you have name2
captured then you could use (?>(?<-name2>)?)
or a conditional. The problem arises if you might have name2
captured more than once since then it depends on whether you can organise enough repetitions of the deletion of name2
. ((?<-name2>)*
doesn't work because *
is equivalent to ?
for zero-length matches.)