Search code examples
.netregexlookbehindregex-lookaroundsbalancing-groups

match regex with variable length look-behind of a word and variable length negative look-behind of another word?


I have a regular expression that captures a pattern A only if it the string contains a pattern B somewhere before A.

Let's say, for the sake of simplicity, that A is \b\d{3}\b (i.e. three digits) and B is the word "foo".

Therefore the Regex I have is (?<=\b(?:foo)\b.*?)(?<A>\b\d{3}\b).

(?<=               # look-behind
    \b(?:foo)\b    # pattern B
    .*?            # variable length
)
(?<A>\b\d{3}\b)    # pattern A

For example, for the string

"foo text 111, 222 and not bar something 333 but foo 444 and better 555"

it captures

(111, 222, 333, 444, 555)

I got a new requirement and now I have to exclude the captures that are preceded by pattern C, lets say that C is the word "bar". What I want to build is a regex that expresses

(?<=               # look-behind
    \b(?:foo)\b    # pattern B
    ???????????    # anything that does not contains pattern C
)
(?<A>\b\d{3}\b)    # pattern A

So, in the example string I will have to capture

(111, 222, 444, 555)

Of course something like (?<=\b(?:foo)\b.*?)(?<!\b(?:bar)\b.*?)(?<A>\b\d{3}\b)

(?<=               # look-behind
    \b(?:foo)\b    # pattern B
    .*?
)
(?<!               # negative look-behind
    \b(?:bar)\b    # pattern C
    .*?
)
(?<A>\b\d{3}\b)    # pattern A

will not work as it will exclude everything after the first appearance of "bar" and the capture will be

(111, 222)

The regex (?<=\b(?:foo)\b(?!.*?(?:\bbar\b)).*?)(?<A>\b\d{3}\b)

(?<=                     # look-behind
    \b(?:foo)\b          # pattern B
    (?!                  # negative lookahead
        .*?              # variable lenght
        (?:\bbar\b)      # pattern C
    )
    .*?                  # variable lenght
)
(?<A>\b\d{3}\b)          # pattern A

will not work either because for the first "foo" in my test string, it will always find the "bar" as a suffix and it will only capture

(444, 55)

So far, using Conditional Matching of Expressions and (now) knowing that while inside a lookbehind, .net matches and captures from the right to the left, I was able to create the following regex (?<=(?(C)(?!)| (?:\bfoo\b))(?:(?<!\bbar)\s|(?<C>\bbar\s)|[^\s])*)(?<A>\b\d{3}\b)

(?<=                     # look-behind
    (?(C)                # if capture group C is not empty
        (?!)             # fail (pattern C was found)
        |                # else
        (?:\bfoo\b)      # pattern B
    )
    (?:
        (?<!\bbar)\s     # space not preceeded by pattern C (consume the space)
        |
        (?<C>\bbar\s)    # pattern C followed by space (capture in capture group C)
        |
        [^\s]            # anything but space (just consume)
    )*                   # repeat as needed
)
(?<A>\b\d{3}\b)          # pattern A

which works but is too complex as the patters A, B and C are a lot more complex that the examples I have posted here.

Is it possible to simplify this regex? Maybe using balancing groups?


Solution

  • One simple option is very similar to Casimir et Hippolyte's second pattern:

    foo(?>(?<A>\b\d{3}\b)|(?!bar).)+
    
    • Start with foo
    • (?>|(?!bar).)+ - Stop matching if you've seen bar.
    • (?<A>\b\d{3}\b) and capture all A's that you see along the way.
    • Atomic group (?>) isn't necessary in this case, backtracking wouldn't mess this up either way.

    Working example

    Similarly, it can be converted to a lookbehind:

    (?<=foo(?:(?!bar).)*?)(?<A>\b\d{3}\b)
    

    This has the benefit of matching only the numbers. The lookbehind asserts there is a foo before A, but there isn't an bar.
    Working example

    Both of these assume B and C are somewhat simple.