In building a lightweight tool that detects censored profanity usage, I noticed that detecting special characters at the end of a word boundary is quite difficult.
Using a tuple of strings, I build a OR'd word boundary regular expression:
import re
PHRASES = (
'sh\\*t', # easy
'sh\\*\\*', # difficult
'f\\*\\*k', # easy
'f\\*\\*\\*', # difficult
)
MATCHER = re.compile(
r"\b(%s)\b" % "|".join(PHRASES),
flags=re.IGNORECASE | re.UNICODE)
The problem is that the *
is not something that can be detected next to a word boundary \b
.
print(MATCHER.search('Well f*** you!')) # Fail - Does not find f***
print(MATCHER.search('Well f***!')) # Fail - Does not find f***
print(MATCHER.search('f***')) # Fail - Does not find f***
print(MATCHER.search('f*** this!')) # Fail - Does not find f***
print(MATCHER.search('secret code is 123f***')) # Pass - Should not match
print(MATCHER.search('f**k this!')) # Pass - Should find
Any ideas for setting this up in a convenient way to support phrases that end in special characters?
The *
is not a word character thus no mach, if followed by a \b and a non word character.
Assuming the initial word boundary is fine but you want to match sh*t
but not sh*t*
or match f***!
but not f***a
how about simulating your own word boundary by use of a negative lookahead.
\b(...)(?![\w*])
If needed, the opening word boundary \b
can be replaced by a negative lookbehind: (?<![\w*])