Search code examples
pythonregexpython-re

re.findall()'s behavior for patterns with a single capturing group followed by a quantifier


re.findall() does not seem to return the actual matches if the pattern has a single quantified capturing group.

For instance:

p1 = r"(apple)*"
t1 = "appleappleapple"

re.findall(p1, t1) # returns "['apple', '']" 

Whereas using the same arguments in

[i.group() for i in re.finditer(p1, t1)]

yields the exact matches, which is ['appleappleapple', '']

Another thing that's puzzling me is this behavior:

t2 = "appleapplebananaapplebanana"

re.findall(p1, t2) will return "['apple', '', '', '', '', '', '', 'apple', '', '', '', '', '', '', '']"

Where exactly are those extra empty strings coming from? Why does findall() capture them before the end of the input string?


Solution

  • I believe that @Deepak's answer does not fully address the question.

    Let's examine the first code snippet:

    p1 = r"(apple)*"
    t1 = "appleappleapple"
    
    re.findall(p1, t1)  # returns "['apple', '']"
    

    Let's clarify our expectations. I had anticipated the output of the above snippet to be ['appleappleapple', '']. This is because findall should greedily match until the end, and since it only provides non-overlapping matches, the only other match should be the empty string.

    However, why is the output different?

    As mentioned in the docs, if one or more groups are present in the string, it is the groups that are returned. This is why you obtain apple as the match and not appleappleapple.

    Now, concerning the third snippet, I believe Deepak's answer does address it. Nevertheless, for the sake of completeness, I'll mention it here too:

    t2 = "appleapplebananaapplebanana"
    
    re.findall(p1, t2) will return "['apple', '', '', '', '', '', '', 'apple', '', '', '', '', '', '', '']"
    

    Since you used *, it will match 0 or more occurrences of the group. This is why you obtain all these empty strings.