Search code examples
pythonpython-3.xpython-re

Why does the order of expressions matter in re.match?


I'm making a function that will take a string like "three()" or something like "{1 + 2}" and put them into a list of token (EX: "three()" = ["three", "(", ")"] I using the re.match to help separate the string.

def lex(s):
# scan input string and return a list of its tokens
seq = []
patterns = (r"^(\t|\n|\r| )*(([a-z])*|[0-9]|\(|\)|\*|\/|)(\t|\n|\r| )*")
m = re.match(patterns,s)
while m != None:
    if s == '':
        break
    seq.append(m.group(2))
    s = s[len(m.group(0)):]
    m = re.match(patterns,s)
return seq

This one works if the string is just "three". But if the string contains "()" or any symbol it stays in the while loop. But a funny thing happens when move ([a-z])* in the pattern string it works. Why is that happening?

works: patterns = (r"^(\t|\n|\r| )*([0-9]|\(|\)|\*|\/|([a-z])*)(\t|\n|\r| )*")
Does not work: patterns = (r"^(\t|\n|\r| )*(([a-z])*|[0-9]|\(|\)|\*|\/)(\t|\n|\r| )*")

Solution

  • This one is a bit tricky, but the problem is with this part ([a-z])*. This matches any string of lowercase letters size 0 (zero) or more.

    If you put this sequence at the end, like here:

    patterns = (r"^(\t|\n|\r| )*([0-9]|\(|\)|\*|\/|([a-z])*)(\t|\n|\r| )*")
    

    The regex engine will try the other matches first, and if it finds a match, stop there. Only if none of the others match, does it try ([a-z])* and since * is 'greedy', it will match all of three, then proceed to match ( and finally ).

    Read an explanation of how the full expression is tested in the documentation (thanks to @kaya3).

    However, if you put that sequence a the start, like here:

    patterns = (r"^(\t|\n|\r| )*(([a-z])*|[0-9]|\(|\)|\*|\/)(\t|\n|\r| )*")
    

    It will now try to match it first. It's still greedy, so three still gets matched. But then on the next try, it will try to match ([a-z])* to the remaining '()' - and it matches, since that string starts with zero letters.

    It keeps matching it like that, and gets stuck in the loop. You can fix it by changing the * for a + which will only match if there is 1 or more matches:

    patterns = (r"^(\t|\n|\r| )*(([a-z])+|[0-9]|\(|\)|\*|\/)(\t|\n|\r| )*")