Search code examples
regexflex-lexer

Flex-lexer how to match simple statements?


I am writing flex code to match simple statements of c++. Like:

a=b+c;
a=12;

etc.

What I have written is:

stat ^[a-zA-Z][a-zA-Z0-9]*+"="([a-zA-Z][a-zA-Z0-9]*|([0-9][^a-zA-Z])+)+(("+"|"-"|"*"|"/")([a-zA-Z][a-zA-Z0-9]*|([0-9][^a-zA-Z])+)+)*+";"$

It is accepting statement c=a+b*23; a=2+32; but not a=2+3;.

The above code is: If a variable name starts from a-zA-Z then accept it, but if it starts with a number then reject this.

So ([a-zA-Z][a-zA-Z0-9]*|([0-9][^a-zA-Z])+) will match if a word starting with alphabet then there can be digit or alphabet, but if there is a digit then next character should be digit (for statements like a=10;).


Solution

  • The idea behind a lexical scanner is that it identifies individual tokens (identifiers, literal constants, operators, punctuation, etc.), not complete syntactic constructs like statements.

    Trying to use regular expression patterns to recognise something as complex as an expression is almost bound to result in failure, even expressions without parentheses. These can possibly be recognised by a regular expression but dealing with all the corner cases is going to make the pattern unnecessarily complicated. And once you add parentheses, the task becomes impossible (at least for flex's pattern language, which really is regular unlike most regex libraries).

    Instead, use the scanner to split the input into simple pieces (tokens) and discard ignorable sequences (whitespace and constants). The resulting tokens can then be analysed by a context-free parser.