I have a regular expression that captures a pattern A only if it the string contains a pattern B somewhere before A.
Let's say, for the sake of simplicity, that A is \b\d{3}\b
(i.e. three digits) and B is the word "foo".
Therefore the Regex I have is (?<=\b(?:foo)\b.*?)(?<A>\b\d{3}\b)
.
(?<= # look-behind
\b(?:foo)\b # pattern B
.*? # variable length
)
(?<A>\b\d{3}\b) # pattern A
For example, for the string
"foo text 111, 222 and not bar something 333 but foo 444 and better 555"
it captures
(111, 222, 333, 444, 555)
I got a new requirement and now I have to exclude the captures that are preceded by pattern C, lets say that C is the word "bar". What I want to build is a regex that expresses
(?<= # look-behind
\b(?:foo)\b # pattern B
??????????? # anything that does not contains pattern C
)
(?<A>\b\d{3}\b) # pattern A
So, in the example string I will have to capture
(111, 222, 444, 555)
Of course something like (?<=\b(?:foo)\b.*?)(?<!\b(?:bar)\b.*?)(?<A>\b\d{3}\b)
(?<= # look-behind
\b(?:foo)\b # pattern B
.*?
)
(?<! # negative look-behind
\b(?:bar)\b # pattern C
.*?
)
(?<A>\b\d{3}\b) # pattern A
will not work as it will exclude everything after the first appearance of "bar" and the capture will be
(111, 222)
The regex (?<=\b(?:foo)\b(?!.*?(?:\bbar\b)).*?)(?<A>\b\d{3}\b)
(?<= # look-behind
\b(?:foo)\b # pattern B
(?! # negative lookahead
.*? # variable lenght
(?:\bbar\b) # pattern C
)
.*? # variable lenght
)
(?<A>\b\d{3}\b) # pattern A
will not work either because for the first "foo" in my test string, it will always find the "bar" as a suffix and it will only capture
(444, 55)
So far, using Conditional Matching of Expressions and (now) knowing that while inside a lookbehind, .net matches and captures from the right to the left, I was able to create the following regex (?<=(?(C)(?!)| (?:\bfoo\b))(?:(?<!\bbar)\s|(?<C>\bbar\s)|[^\s])*)(?<A>\b\d{3}\b)
(?<= # look-behind
(?(C) # if capture group C is not empty
(?!) # fail (pattern C was found)
| # else
(?:\bfoo\b) # pattern B
)
(?:
(?<!\bbar)\s # space not preceeded by pattern C (consume the space)
|
(?<C>\bbar\s) # pattern C followed by space (capture in capture group C)
|
[^\s] # anything but space (just consume)
)* # repeat as needed
)
(?<A>\b\d{3}\b) # pattern A
which works but is too complex as the patters A, B and C are a lot more complex that the examples I have posted here.
Is it possible to simplify this regex? Maybe using balancing groups?
One simple option is very similar to Casimir et Hippolyte's second pattern:
foo(?>(?<A>\b\d{3}\b)|(?!bar).)+
foo
(?>
…|(?!bar).)+
- Stop matching if you've seen bar
.(?<A>\b\d{3}\b)
and capture all A's that you see along the way.(?>)
isn't necessary in this case, backtracking wouldn't mess this up either way.Similarly, it can be converted to a lookbehind:
(?<=foo(?:(?!bar).)*?)(?<A>\b\d{3}\b)
This has the benefit of matching only the numbers. The lookbehind asserts there is a foo
before A, but there isn't an bar
.
Working example
Both of these assume B and C are somewhat simple.