Search code examples
pythonregexnon-greedy

Why is the minimal (non-greedy) match affected by the end of string character '$'?


EDIT: remove original example because it provoked ancillary answers. also fixed the title.

The question is why the presence of the "$" in the regular expression effects the greedyness of the expression:

Here is a simpler example:

>>> import re
>>> str = "baaaaaaaa"
>>> m = re.search(r"a+$", str)
>>> m.group()
'aaaaaaaa'
>>> m = re.search(r"a+?$", str)
>>> m.group()
'aaaaaaaa'

The "?" seems to be doing nothing. Note the when the "$" is removed, however, then the "?" is respected:

>>> m = re.search(r"a+?", str)
>>> m.group()
'a'

EDIT: In other words, "a+?$" is matching ALL of the a's instead of just the last one, this is not what I expected. Here is the description of the regex "+?" from the python docs: "Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched."

This does not seem to be the case in this example: the string "a" matches the regex "a+?$", so why isn't the match for the same regex on the string "baaaaaaa" just a single a (the rightmost one)?


Solution

  • Matches are "ordered" by "left-most, then longest"; however "longest" is the term used before non-greedy was allowed, and instead means something like "preferred number of repetitions for each atom". Being left-most is more important than the number of repetitions. Thus, "a+?$" will not match the last A in "baaaaa" because matching at the first A starts earlier in the string.

    (Answer changed after OP clarification in comments. See history for previous text.)