I want to match all cases where a hyphenated string (which could be made up of one or multiple hyphenated segments) ends in a consonant that is not the letter m.
In other words, it needs to match strings such as: 'crack-l', 'crac-ken', 'cr-ca-cr-cr' etc. but not 'crack' (not hyphenated), 'br-oom' (ends in m), br -oo (last segment ends in vowel) or cr-ca-cr-ca (last segment ends in vowel).
It is mostly successful except for cases where there is more than one hyphen, then it will return part of the string such as 'cr-ca-cr' instead of the whole string which should be 'cr-ca-cr-ca'.
Here is the code I have tried with example data:
import re
dummy_data = """
broom
br-oom
br-oo
crack
crack-l
crac-ken
crack-ed
cr-ca-cr-ca
cr-ca-cr-cr
cr-ca-cr-cr-cr
"""
pattern = r'\b(?:\w+-)+\w*[bcdfghjklnpqrstvwxyz](?<!m)\b'
final_consonant_hyphenated = [
m.group(0)
for m in re.finditer(pattern, dummy_data, flags=re.IGNORECASE)
]
print(final_consonant_hyphenated)`
expected output:
['crack-l', 'crac-ken', 'crack-ed', 'cr-ca-cr-cr', 'cr-ca-cr-cr-cr']
current output:
['crack-l', 'crac-ken', 'crack-ed', **'cr-ca-cr'**, 'cr-ca-cr-cr', 'cr-ca-cr-cr-cr']
(bold string is an incorrect match as it's part of the cr-ca-cr-ca
string where the final segment ends in a vowel not a consonant).
You could add a negative lookahead to prevent having a hyphen after and also an idea to shorten [bcdfghjklnpqrstvwxyz](?<!m)
to [a-z](?<![aeioum])
.
Update: Further as @Thefourthbird mentioned in the comments, as well putting the lookbehind after the word-boundary \b
will result in better performance (fewer steps).
\b(?:\w+-)+\w*[a-z]\b(?<![aeioum])(?!-)
See this demo at regex101 or even \b(?:\w+-)+\w+\b(?<![aeioum\d_])(?!-)
(without the [a-z]
, using \w+
instead of \w*
and also disallowing digits and underscore from \w
in the lookbehind). With a possessive quantifier (using PyPI) further reduced: \b(?:\w+-)+\w++(?<![aeioum\d_])(?!-)