I'm trying to set up regex for a blacklist and whitelist, flagging blacklisted words and ignoring whitelisted words. Here are the rules:
Blacklist words I want to search for and match if found: BUNNY, GARDEN, HOLE
Whitelist words that are clean and can be ignored even though they contain blacklisted words: WHOLE, GARDENER
I made the following regex using negative lookbehind:
(BUNNY|GARDEN|HOLE)(?<!\bWHOLE\b|\bGARDENER\b)
My silly example string: This whole hole is a wholey mistake in the gardener agardener.
I would expect only the following be matched: "hole" "wholey" "agardener"
It mostly works, since "whole" doesn't match but "wholey" does and "agardener" is also a match. However, "gardener" matches even though it's in the whitelist. What am I missing?
You can use
\w*(?:BUNNY|GARDEN|HOLE)\w*\b(?<!\bWHOLE|\bGARDENER)
See the regex demo.
A variation without a lookbehind, but with a lookahead:
\b(?!(?:WHOLE|GARDENER)\b)\w*(?:BUNNY|GARDEN|HOLE)\w*\b
See this regex demo.
Details:
\w*
- zero or more word chars(?:BUNNY|GARDEN|HOLE)
- one of the required word parts\w*
- zero or more word chars\b
- a word boundary(?<!\bWHOLE|\bGARDENER)
- a negative lookbehind that fails the match if there whole word situated on the left is WHOLE
or GARDENER
.The \b(?!(?:WHOLE|GARDENER)\b)\w*(?:BUNNY|GARDEN|HOLE)\w*\b
matches a word boundary first, then fails the match if the next chars are a WHOLE
or GARDENER
whole words and then matches a word with BUNNY
, GARDEN
or HOLE
substring in it.
Replace \w
with [a-zA-Z]
or \p{L}
(or [[:alpha:]]
) if supported and you need to only match letter words.