A Regular Expression with a Conditional in Lookbehind

This is using the .NET regex engine.

I am attempting to use a conditional inside of a lookbehind clause. When the expression is used outside of the lookbehind it behaves as I expect -- but when placed in the lookbehind, it has a different behavior.

Here is a simple example of the effect to try to duplicate the issue.

Matching:

good morning

with the regular expression:

(?<=(?(?=go)good|bad)\s)morning

yields no match.

When tried without the look behind:

(?(?=go)good|bad)\smorning

I get a match on "good morning"

By fiddling around, I discovered that the lookahead cursor location, when it is inside the lookbehind, is after the word "good":

(?<=(?(?=\smor)good|bad)\s)morning

This matches "morning".

My question is is this expected or some kind of bug?

Obviously this example is not real world - the problem that I was trying to solve when I stumbled on this issue is as follows: The expression uses a conditional to determine the length of the next word, then uses two different sets of rules for matching on that word. Similar to:

(?<=\s+(?(?=[^\s]{1,2}\s)[A-Z0-9]+|(?![A-Z]+\s)[0-9-A-Z/"']+))\s+matching\s+text

This matches the "matching text" only if a one or two letter word consisting of letters and numbers, or a longer word not consisting of only letters but can contain numbers, letters, slashes, dashes, quotes and apostrophes.

The following should match "matching text":

1 matching text
a matching text

It only matches on the first one, because the conditional evaluated to false (it was looking at the " matching" instead of "a") and the negative look ahead searching for a word consisting of all letters failed on the "a".

Further examples:

Must match "matching text":

123-1 matching text
9B matching text
15/16 matching text
"45" matching text
A matching text
AA matching text
A1 matching text

Must not match "matching text"

and matching text
" matching text
A- matching text

Solution

I think I understand the issue now. The important difference between a conditional inside of a lookbehind and a conditional not inside a lookbehind is the timing of when the conditional is executed (or where the search cursor is at that point.

The example without a lookbehind:

(?(?=go)good|bad)\smorning

good morning

The conditional is run at the beginning of the search. So the search cursor is before the 'g' in 'good'.

 good morning
^

So at this point the look ahead evaluates to TRUE since it sees matches on the 'go'

In the example with the lookbehind, the cursor is at a different location.

(?<=(?(?=go)good|bad)\s)morning

The search cursor finds the first required item in the text: 'morning'

good morning
     ^

The actual search cursor stays in place to consume 'morning' if it the lookbehind matches. The lookbehind will use its own cursor to verify what is before 'morning' to determine if this 'morning' is a match. The lookbehind states that there is a '\s' directly before 'morning' and indeed there is. The temporary lookbehind cursor moves to the space:

good morning
    ^^

Now it gets to the conditional and runs the lookahead in the conditional statement. At this point the lookahead is looking for 'go' but it sees ' morning'. So the conditional fails. The expression says to try to match on 'bad' (or dab backwards from the lookbehind cursor) but it sees good (or doog backwards from the lookbehind cursor). So no match for 'morning'.

Solution

Since the conditional is run at the end of the word of interest when it is in a lookbehind (instead of at the beginning of that word outside of a lookbehind), the secret is to reverse the conditional to a lookbehind instead of a lookahead:

(?<=(?(?<=good)good|bad)\s)morning

This actually matches on morning. It look nonsensical in this example to look for a word and then match it - but it illustrates the concept. In the real world case stated in the question, the solution looks like this:

(?<=(?(?<=\s\S{1,2})[A-Z0-9]+|(?![A-Z]+\s)[0-9-A-Z/"']+))\s+matching\stext

This looks for a word before matching text. The conditional checks to see if it is a one or two character word. If so, it should consist of only letters and numbers. If not, it must not only consist of letters. If this is met, it can consist of letters, numbers, dashes, slashes, quotes and apostrophes.

Two things changed:

Modified the conditional to be a lookbehind instead of a lookahead
I had to move the '\s' from the beginning of the lookbehind into the conditional, because the conditional is being processed before that space and it results in matching on the last one or two characters of the word instead of looking for a one or two character word. That is a tricky issue, because when the expression is not in a lookbehind (for instance if you want the text in the lookbehind to be included in the match), this change messes up the match.

What follows is some more analysis of the original problem. Enjoy!

@zx81 had an example that is actually better at illustrating what is going on. This example has more cursor movement so it does help illustrate what is happening:

    (?<=(?(?=go)good|bad))\w+pher

badphilosopher    <-- 'philosopher' matches
goodgopher        <-- 'gopher' matches
badgopher         <-- no match

There is a big difference in this example because the \w+ is used. So the regex engine immediately matches all the text in each example since the phrase has no white space and ends in 'pher'.

So for 'badphilosopher':

Lookbehind is run and the conditional is immediately run looking for 'go' but finds 'ba'

badphilosopher
^

The condition failed so it tries to match bad to the left of the cursor, but we are at the beginning of the phrase, so no match.

It checks again at these two cursor points because the '\w+pher' matches each time:

badphilosopher
  ^

But the lookbehind sees 'b' then 'ba'

When the cursor gets to:

badphilosopher
   ^

The conditonal again fails to find a 'go' (it sees 'ph') so it attempts to match 'bad' to the left of the cursor and finds it! So there fore the \w+pher matches the philosopher text.

goodgopher
    ^

goodgopher matches in a similar way except that the conditional is successful.

badgopher
   ^

badgopher does not match because the conditonal is successful but 'good' is not found to the left of the cursor.

Putting a space in there really changes things up, because the /w+pher no longer matches the entire string.

    (?<=(?(?=go)good|bad)\s+)\w+pher

bad philosopher    <-- matches philosopher
good gopher        <-- no match
bad gopher         <-- matches gopher

In this case the cursor moves through the string until it can match \w+pher:

bad philosopher
    ^

At this point it starts to process the lookbehind -- and sees that a '\s+' is required to the left of the search cursor -- it finds this and moves the temporary lookbehind cursor over.

bad philosopher
   ^^

The conditional is run now and sees looking for 'go' at the temp lookbehind cursor but finds ' p'. The failure means trying to match bad to the left of the temp lookbehind cursor and indeed it finds it there.

good gopher
    ^^

The 'good gopher' example gets to the conditional and sees ' g' so it fails and then looks for 'bad' to the left of the cursor and doesn't find it. So this fails.

bad philosopher
   ^^

Similarly, 'bad philosopher' gets to the conditonal and finds ' p' and looks for 'bad' to the left of the cursor and finds it. So it matches.

When run without the lookbehind, all of these examples match. This can be rather counterintuitive - but you have to take the location of the cursors into account in the lookbehind.