Search code examples
pythonpython-refindall

How do I write a regular expression to find all words which have 2 or more of the same consonant in a sequence


I'm trying to write a regular expression to find all words which contain a sequence of 2 or more of the same consonants.

I have tried the following but it is broken:

xh_data = ("mmh tshhu itshu mama krrrr")
onomat_consonant_words = re.findall (r'\b\w*([b-df-hj-np-tv-z])\1\w*\b', xh_data, flags=re.IGNORECASE)

print (onomat_consonant_words)

It should give the following output ['mmh', 'tshhu', 'krrr'] it currently just gives ['m','h','r']

Trying to use back referencing with the \1 but I am not sure I am doing it correctly here.


Solution

  • There are two issues here:

    • { should not be in your regex. It looks for a literal opening brace... (you removed it after I made the comment).

    • As the documentation says about findall, it will not return the full matches when you have capture groups in your regex, but only what is captured by those groups.

    One solution is to use finditer and extract the complete match:

    onomat_consonant_words = [
        m[0]
        for m in re.finditer(r'\b\w*([b-df-hj-np-tv-z])\1\w*', xh_data, flags=re.IGNORECASE)
    ]
    

    Note that you don't really need the final \b. It is implied by the greedy \w*.