Search code examples
cregexbisonflex-lexer

Regular expression unexpected pattern matching


I am trying to create a syntax parser using C-Bison and Flex. In Flex I have a regular expression which matches integers based on the following:

  1. Must start with any digit in range 1-9 and followed by any number of digits in range 0-9. (ex. Correct: 1,12,11024 | Incorrect: 012)

  2. Can be signed (ex. +2,-5)

  3. The number 0 must not be followed by any digit (0-9) and must not signed. (ex. Correct: 0 | Incorrect: 012,+0,-0)

Here is the regex I have created to perform the matching: [^+-]0[^0-9]|[+-]?[1-9][0-9]*

Here is the expression I am testing: (1 + 1 + 10)

The matches:

1
1
10)

And here is my question, why does it match '10)'?

The reason I used the above expression, instead of the much simpler one, (0|[+-]?[1-9][0-9]*) is due to inability of the parser to recognise incorrect expressions such as 012.

The problem seems to occur only when before the ')' precedes the digit '0'. However if the '0' is preceded by two or more digits (ex. 100), then the ')' is not matched.

I know for a fact if I remove [^0-9] from the regex it doesn't match the ')'.


Solution

  • It matches 10( because 1 matches [^+-], 0 matches 0 and ( matches [^0-9].

    The reason I used the above expression, instead of the much simpler one, (0|[+-]?[1-9][0-9]*) is due to inability of the parser to recognise incorrect expressions such as 012.

    How so? Using the above regex, 012 would be recognized as two tokens: 0 and 12. Would this not cause an error in your parser?

    Admittedly, this would not produce a very good error message, so a better approach might be to just use [0-9]+ as the regex and then use the action to check for a leading zero. That way 012 would be a single token and the lexer could produce an error or warning about the leading zero (I'm assuming here that you actually want to disallow leading zeros - not use them for octal literals).

    Instead of a check in the action, you could also keep your regex and then add another one for integers with a leading zero (like 0[0-9]+ { warn("Leading zero"); return INT; }), but I'd go with the check in the action since it's an easy check and it keeps the regex short and simple.

    PS: If you make - and + part of the integer token, something like 2+3 will be seen as the integer 2, followed by the integer +3, rather than the integers 2 and 3 with a + token in between. Therefore it is generally a better idea to not make the sign a part of the integer token and instead allow prefix + and - operators in the parser.