I am trying to extract out all instances of the VCV (vowel consonant vowel) pattern in a word using regex. This should also include the start and end, which could be CV when at the start or VC when at the end.
Given the word "bookeeping" as an input, the expected output would be:
boo, ookee, eepi, ing
My current attempt using the regex library for overlapping patterns looks like:
import regex as re
word = "bookeeping"
print(re.findall(r'[aeiouy]+?[bcdfghkjlmnpqrstvwxz]+[aeiouy]+', word, overlapped=True))
with the (incorrect) output:
['ookkee', 'okkee', 'eepi', 'epi']
'okkee' is not valid and it does not grab the start or end. How do I force it to exclude words that do not include all preceding matches?
It seems from your expected output that vowels are optional in the vowel-consonant-vowel pattern you're looking for, in which case the following will do:
import re
pos = 0
while True:
match = re.search(r'[aeiouy]*[bcdfghkjlmnpqrstvwxz]+([aeiouy]*)', 'bookeeping'[pos:])
if not match:
break
print(match.group(0))
pos += match.start(1)
This outputs:
boo
ookee
eepi
ing