Search code examples
regexperlnegative-lookbehind

Negative lookbehind in regex


(Note: not a duplicate of Why can't you use repetition quantifiers in zero-width look behind assertions; see end of post.)

I'm trying to write a grep -P (Perl) regex that matches B, when it is not preceded by A -- regardless of whether there is intervening whitespace.

So, I tried this negative lookbehind, and tested it in regex101.com:

(?<!A)\s*B

This causes "AB" not to be matched, which is good, but "A B" does result in a match, which is not what I want.

I am not exactly sure why this is. It has something to do with the fact that \s* matches the empty string "", and you can say that there are, as such, infinity matches of \s* between A and B. But why does this affect "A B" but not "AB"?

Is the following regex a proper solution, and if so, why exactly does it fix the problem?

(?<![A\s])\s*B

I posted this before and it was incorrectly marked as a duplicate question. The variable-length thing I'm looking for is part of the match, not part of the negative lookbehind itself -- so this quite different from the other question. Yes, I could put the \s* inside the negative lookbehind, but I haven't done so (and doing so is not supported, as the other question explains). Also, I am particularly interested in why the alternate regex I post above works, since I know it works but I'm not exactly sure why. The other question did not help answer that.


Solution

  • But why does this affect "A B" but not "AB"?

    Regexes match at a position, which it is helpful to think of as being between characters. In "A B" there is a position (after the space and before the B) where (?<!A) succeeds (because there isn't an A immediately preceding; there's a space instead), and \s*B succeeds (\s* matches the empty string, and B matches B), so the entire pattern succeeds.

    In "AB" there is no such position. The only place where \s*B can match (immediately before the B), is also immediately after the A, so (?<!A) cannot succeed. There are no positions that satisfy both, so the pattern as a whole can't succeed.

    Is the following regex a proper solution, and if so, why exactly does it fix the problem?

    (?<![A\s])\s*B

    This works because (?<![A\s]) will not succeed immediately after an A or after a space. So now the lookbehind forbids any match position that has spaces before it. If there are any spaces before the B, they have to be consumed by the \s* portion of the pattern, and the match position must be before them. If that position also doesn't have an A before it, the lookbehind can succeed and the pattern as a whole can match.

    This is a trick that's made possible by the fact that \s is a fixed-width pattern that matches at every position inside of a non-empty \s* match. It can't be extended to the general case of any pattern between the (non-)A and the B.