Search code examples
pythonregexpython-re

How to add exception for bad word in regular expression for profanity?


I use Python regular expression to do the profanity check. I have a blocked list of words, but there are some corner cases where I want to add exceptions for the bad words.

For example, I have ['foo', 'bar'] in the blocked list. But I want to exempt cases when it is:

  1. "BAR"
  2. "foo good"

This is my current approach in Python:

profanity_list = ['foo', 'bar']
pattern_profanity = re.compile(r'\b({})\b'.format('|'.join(profanity_list)),
 flags=re.IGNORECASE)  # same as r'\b(foo|bar)\b'
s = 'foo BAR foo good Bar'
censor_char = '*'
pattern_profanity.sub(repl=lambda m: censor_char*len(m.group(0)), string=s)

This gave me "*** *** *** good ***", but I want the result to be "*** BAR foo good ***". What I should do to include the exceptional cases? Is this feasible in regular expression? Thanks.

BTW, the solution I found is from this post.


Solution

  • You need

    import re
    profanity_list = ['foo', 'bar']
    whitelist = ["BAR", "foo good"]
    pattern_profanity = re.compile(
      r'\b(?!(?:{})\b)(?i:{})\b'.format('|'.join(whitelist),'|'.join(profanity_list)))  
    s = 'foo BAR foo good Bar'
    censor_char = '*'
    print( re.sub(pattern_profanity, lambda m: censor_char*len(m.group(0)), s) )
    # => *** BAR foo good ***
    

    See the Python demo

    The pattern is \b(?!(?:BAR|foo good)\b)(?i:foo|bar)\b. See the regex demo. It matches:

    • \b - a word boundary
    • (?!(?:BAR|foo good)\b) - not immediately followed with BAR, foo good
    • (?i:foo|bar) - a case insensitive modifier group: foo or bar matched as...
    • \b - whole word.