While working on an answer to this question, I came up with this regex:
(?:(?!\2)(?:,foo=([^,]*),(?=())|.))*\2bar=2
(Note: this regex requires the PyPI regex
module)
(Short explanation: The regex relies on the fact that capture groups in lookaheads can't change their value after they've matched once, so after the first foo=
is found, the (?=())
matches and from that point onwards (?!\2)
will always fail.)
This regex works correctly with the 2 examples given in the question:
>>> pattern = r'(?:(?!\2)(?:,foo=([^,]*),(?=())|.))*\2bar=2'
>>> regex.match(pattern, 'baz=0,foo=1,bar=2,foo=3,bar=4').group(1)
'1'
>>> regex.match(pattern, 'baz=0,foo=1,foo=1,bar=2')
>>>
But something strange happens if there's an occurrence of foo=
after a bar=2
:
>>> # this doesn't match, as expected:
>>> regex.match(pattern, 'notfoo=1,bar=2')
>>> # but how the heck does it match this ?!
>>> regex.match(pattern, 'notfoo=1,bar=2,foo=3,')
<regex.Match object; span=(0, 14), match='notfoo=1,bar=2'>
As you can see, the string 'notfoo=1,bar=2,foo=3,'
produced a match of notfoo=1,bar=2
. The foo=3,
isn't even included in the match, but if it's removed, the regex no longer matches! How is this possible? Is this a bug in the regex
module?
This actually makes perfect sense. The reason for this behavior is simple: backtracking.
The sequence of events is as follows:
(?:...)*
advances one character at a time until it finally finds an occurrence of foo=
at ,foo=3,
bar=2
, but this failsbar=2
matches, giving us a result of notfoo=1,bar=2
.So, what can we do about this? We can move the bar=2
into the greedy group and use another capture group to assert that the regex matched successfully:
(?:(?!\2)(?:,foo=([^,]*),(?=())bar=2()|.))*\3
>>> pattern = r'(?:(?!\2)(?:,foo=([^,]*),(?=())bar=2()|.))*\3'
>>> regex.match(pattern, 'baz=0,foo=1,bar=2,foo=3,bar=4').group(1)
'1'
>>> regex.match(pattern, 'baz=0,foo=1,foo=1,bar=2')
>>> regex.match(pattern, 'notfoo=1,bar=2')
>>> regex.match(pattern, 'notfoo=1,bar=2,foo=3,')
>>>