Search code examples
pythonregex-greedy

Finding specific overlapping pattern using Python


I am trying to extract out all instances of the VCV (vowel consonant vowel) pattern in a word using regex. This should also include the start and end, which could be CV when at the start or VC when at the end.

Given the word "bookeeping" as an input, the expected output would be:

boo, ookee, eepi, ing

My current attempt using the regex library for overlapping patterns looks like:

import regex as re

word = "bookeeping"
print(re.findall(r'[aeiouy]+?[bcdfghkjlmnpqrstvwxz]+[aeiouy]+', word, overlapped=True))

with the (incorrect) output:

['ookkee', 'okkee', 'eepi', 'epi']

'okkee' is not valid and it does not grab the start or end. How do I force it to exclude words that do not include all preceding matches?


Solution

  • It seems from your expected output that vowels are optional in the vowel-consonant-vowel pattern you're looking for, in which case the following will do:

    import re
    pos = 0
    while True:
        match = re.search(r'[aeiouy]*[bcdfghkjlmnpqrstvwxz]+([aeiouy]*)', 'bookeeping'[pos:])
        if not match:
            break
        print(match.group(0))
        pos += match.start(1)
    

    This outputs:

    boo
    ookee
    eepi
    ing