Search code examples
pythonregexnfa

Python regular expression


I have an expression like:

^(?P<stereo1>/?|\\\\?)(?P<bond1>|=?|\.|#?)(?P<number1>[0-9%]*)(?P<branching>[()]*)(?P<stereo2>/?|\\\\?)(?P<bond2>|=?|\.|#?)(?P<number2>[0-9%]*)$

And suppose we have a string '\1' After:

re.match(regexp, string)

stereo2 = '\' and number2=1.

My question is: why stereo1 != '\' and 'number1' != '1'?

Also when we have string '/1'

re.match(regexp,string)

Output: stereo1 = '/', number1 = '1'


Solution

  • When a pattern contains an alternation, the regex engine tries to find a match with each branches from the leftmost to the last branch. It is the default behaviour of NFA engines. So, if there is a match with the leftmost branch other branches are not tested.

    What's happening in your particular case?

    (?P<stereo1>/?|\\\\?) succeeds with its first branch /? and matches the empty string (since the slash is optional), and the second branch is never tested.

    When (?P<stereo2>/?|\\\\?) is reached, the same scenario happens, but when the regex engine arrived to the end anchor $, the pattern fails. Then the regex engine backtracks until (?P<stereo2>/?|\\\\?) and tests the second branch that succeeds.

    Note: a DFA regex engine has a different behaviour, it tests each branches and keeps the branch that has the larger result.

    So if you want to capture the backslash with the stereo1 group, you only need to permute the branches: (?P<stereo1>\\\\?|/?)