Search code examples
regexprofanity

Match star * character at end of word boundary \b


In building a lightweight tool that detects censored profanity usage, I noticed that detecting special characters at the end of a word boundary is quite difficult.

Using a tuple of strings, I build a OR'd word boundary regular expression:

import re

PHRASES = (
    'sh\\*t',  # easy
    'sh\\*\\*',  # difficult
    'f\\*\\*k',  # easy
    'f\\*\\*\\*',  # difficult
)

MATCHER = re.compile(
    r"\b(%s)\b" % "|".join(PHRASES), 
    flags=re.IGNORECASE | re.UNICODE)

The problem is that the * is not something that can be detected next to a word boundary \b.

print(MATCHER.search('Well f*** you!'))  # Fail - Does not find f***
print(MATCHER.search('Well f***!'))  # Fail - Does not find f***
print(MATCHER.search('f***'))  # Fail - Does not find f***
print(MATCHER.search('f*** this!'))  # Fail - Does not find f***
print(MATCHER.search('secret code is 123f***'))  # Pass - Should not match
print(MATCHER.search('f**k this!'))  # Pass - Should find 

Any ideas for setting this up in a convenient way to support phrases that end in special characters?


Solution

  • The * is not a word character thus no mach, if followed by a \b and a non word character.

    Assuming the initial word boundary is fine but you want to match sh*t but not sh*t* or match f***! but not f***a how about simulating your own word boundary by use of a negative lookahead.

    \b(...)(?![\w*])
    

    See this demo at regex101

    If needed, the opening word boundary \b can be replaced by a negative lookbehind: (?<![\w*])