From regular-expressions.info:
\b\w+(?<!s)\b
. This is definitely not the same as\b\w+[^s]\b
. When applied toJon's
, the former will matchJon
and the latterJon'
(including the apostrophe). I will leave it up to you to figure out why. (Hint: \b matches between the apostrophe and the s). The latter will also not match single-letter words like "a" or "I".
Can you explain why ?
Also, can you make clear what exacly \b
does, and why it matches between the apostrophe and the s
?
\b
is a zero-width assertion that means word boundary. These character positions (taken from that link) are considered word boundaries:
- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between two characters in the string, where one is a word character and the other is not a word character.
Word characters are of course any \w
. s
is a word character, but '
is not. In the above example, the area between the '
and the s
is a word boundary.
The string "Jon's"
looks like this if I highlight the anchors and boundaries (the first and last \b
s occur in the same positions as ^
and $
): ^Jon\b'\bs$
The negative lookbehind assertion (?<!s)\b
means it will only match a word boundary if it's not preceded by the letter s
(i.e. the last word character is not an s
). So it looks for a word boundary under a certain condition.
Therefore the first regex works like this:
\b\w+
matches the first three letters J
o
n
.
There's actually another word boundary between n
and '
as shown above, so (?<!s)\b
matches this word boundary because it's preceded by an n
, not an s
.
Since the end of the pattern has been reached, the resultant match is Jon
.
The complementary character class [^s]\b
means it will match any character that is not the letter s
, followed by a word boundary. Unlike the above, this looks for one character followed by a word boundary.
Therefore the second regex works like this:
\b\w+
matches the first three letters J
o
n
.
Since the '
is not the letter s
(it fulfills the character class [^s]
), and it's followed by a word boundary (between '
and s
), it's matched.
Since the end of the pattern has been reached, the resultant match is Jon'
. The letter s
is not matched because the word boundary before it has already been matched.