Search code examples
regexblacklist

Regex for blacklist and whitelist words


I'm trying to set up regex for a blacklist and whitelist, flagging blacklisted words and ignoring whitelisted words. Here are the rules:

  1. I want to see if a word or phrase on the blacklist exists in the input string.
  2. The blacklist words should be matched regardless of where they appear (full word or as substring).
  3. The whitelist words (i.e. words that are known to be okay even though they contain blacklisted words) are not to be matched if they are full words only.

Blacklist words I want to search for and match if found: BUNNY, GARDEN, HOLE

Whitelist words that are clean and can be ignored even though they contain blacklisted words: WHOLE, GARDENER

I made the following regex using negative lookbehind: (BUNNY|GARDEN|HOLE)(?<!\bWHOLE\b|\bGARDENER\b)

My silly example string: This whole hole is a wholey mistake in the gardener agardener.

I would expect only the following be matched: "hole" "wholey" "agardener"

It mostly works, since "whole" doesn't match but "wholey" does and "agardener" is also a match. However, "gardener" matches even though it's in the whitelist. What am I missing?


Solution

  • You can use

    \w*(?:BUNNY|GARDEN|HOLE)\w*\b(?<!\bWHOLE|\bGARDENER)
    

    See the regex demo.

    A variation without a lookbehind, but with a lookahead:

    \b(?!(?:WHOLE|GARDENER)\b)\w*(?:BUNNY|GARDEN|HOLE)\w*\b
    

    See this regex demo.

    Details:

    • \w* - zero or more word chars
    • (?:BUNNY|GARDEN|HOLE) - one of the required word parts
    • \w* - zero or more word chars
    • \b - a word boundary
    • (?<!\bWHOLE|\bGARDENER) - a negative lookbehind that fails the match if there whole word situated on the left is WHOLE or GARDENER.

    The \b(?!(?:WHOLE|GARDENER)\b)\w*(?:BUNNY|GARDEN|HOLE)\w*\b matches a word boundary first, then fails the match if the next chars are a WHOLE or GARDENER whole words and then matches a word with BUNNY, GARDEN or HOLE substring in it.

    Replace \w with [a-zA-Z] or \p{L} (or [[:alpha:]]) if supported and you need to only match letter words.