Search code examples
regexlookbehindnegative-lookbehind

Regex: Difference betwen negative lookbehind and negation


From regular-expressions.info:

\b\w+(?<!s)\b. This is definitely not the same as \b\w+[^s]\b. When applied to Jon's, the former will match Jon and the latter Jon' (including the apostrophe). I will leave it up to you to figure out why. (Hint: \b matches between the apostrophe and the s). The latter will also not match single-letter words like "a" or "I".

Can you explain why ?

Also, can you make clear what exacly \b does, and why it matches between the apostrophe and the s ?


Solution

  • \b is a zero-width assertion that means word boundary. These character positions (taken from that link) are considered word boundaries:

    • Before the first character in the string, if the first character is a word character.
    • After the last character in the string, if the last character is a word character.
    • Between two characters in the string, where one is a word character and the other is not a word character.

    Word characters are of course any \w. s is a word character, but ' is not. In the above example, the area between the ' and the s is a word boundary.

    The string "Jon's" looks like this if I highlight the anchors and boundaries (the first and last \bs occur in the same positions as ^ and $): ^Jon\b'\bs$

    The negative lookbehind assertion (?<!s)\b means it will only match a word boundary if it's not preceded by the letter s (i.e. the last word character is not an s). So it looks for a word boundary under a certain condition.

    Therefore the first regex works like this:

    1. \b\w+ matches the first three letters J o n.

    2. There's actually another word boundary between n and ' as shown above, so (?<!s)\b matches this word boundary because it's preceded by an n, not an s.

    3. Since the end of the pattern has been reached, the resultant match is Jon.

    The complementary character class [^s]\b means it will match any character that is not the letter s, followed by a word boundary. Unlike the above, this looks for one character followed by a word boundary.

    Therefore the second regex works like this:

    1. \b\w+ matches the first three letters J o n.

    2. Since the ' is not the letter s (it fulfills the character class [^s]), and it's followed by a word boundary (between ' and s), it's matched.

    3. Since the end of the pattern has been reached, the resultant match is Jon'. The letter s is not matched because the word boundary before it has already been matched.