Search code examples
pythonregexpypi-regex

How does adding some text make this regex match the input even though there's no lookahead?


While working on an answer to this question, I came up with this regex:

(?:(?!\2)(?:,foo=([^,]*),(?=())|.))*\2bar=2

(Note: this regex requires the PyPI regex module)

(Short explanation: The regex relies on the fact that capture groups in lookaheads can't change their value after they've matched once, so after the first foo= is found, the (?=()) matches and from that point onwards (?!\2) will always fail.)

This regex works correctly with the 2 examples given in the question:

>>> pattern = r'(?:(?!\2)(?:,foo=([^,]*),(?=())|.))*\2bar=2'
>>> regex.match(pattern, 'baz=0,foo=1,bar=2,foo=3,bar=4').group(1)
'1'
>>> regex.match(pattern, 'baz=0,foo=1,foo=1,bar=2')
>>>

But something strange happens if there's an occurrence of foo= after a bar=2:

>>> # this doesn't match, as expected:
>>> regex.match(pattern, 'notfoo=1,bar=2')
>>> # but how the heck does it match this ?!
>>> regex.match(pattern, 'notfoo=1,bar=2,foo=3,')
<regex.Match object; span=(0, 14), match='notfoo=1,bar=2'>

As you can see, the string 'notfoo=1,bar=2,foo=3,' produced a match of notfoo=1,bar=2. The foo=3, isn't even included in the match, but if it's removed, the regex no longer matches! How is this possible? Is this a bug in the regex module?


Solution

  • This actually makes perfect sense. The reason for this behavior is simple: backtracking.

    The sequence of events is as follows:

    1. The greedy group (?:...)* advances one character at a time until it finally finds an occurrence of foo= at ,foo=3,
    2. The regex attempts to match bar=2, but this fails
    3. The regex backtracks 1 character at a time until bar=2 matches, giving us a result of notfoo=1,bar=2.

    So, what can we do about this? We can move the bar=2 into the greedy group and use another capture group to assert that the regex matched successfully:

    (?:(?!\2)(?:,foo=([^,]*),(?=())bar=2()|.))*\3
    
    >>> pattern = r'(?:(?!\2)(?:,foo=([^,]*),(?=())bar=2()|.))*\3'
    >>> regex.match(pattern, 'baz=0,foo=1,bar=2,foo=3,bar=4').group(1)
    '1'
    >>> regex.match(pattern, 'baz=0,foo=1,foo=1,bar=2')
    >>> regex.match(pattern, 'notfoo=1,bar=2')
    >>> regex.match(pattern, 'notfoo=1,bar=2,foo=3,')
    >>>