c compiler-construction flex-lexer lex lexical-analysis

How to define numbers format in Flex (lexical analyzer)?

What I need :

Acceptable > 1234 & 12.34

Error (Non acceptable) > 12.34.56

Scanner.L :

      ...
%%

[0-9]+                printf("Number ");
[0-9]+"."[0-9]+       printf("Decimal_Number ");
"."                   printf("Dot "):

%%
      ...

After compile & run :

Input :
1234    12.34    12.34.65

Output :
Number    Decimal_Number      Decimal_Number Dot Number

How to print Error instead of Decimal_Number Dot Number (Or just ignore it) ?

~~Is it possible to define space before & after number as seperator ?~~

Solution

It's often considered better to detect errors like 12.34.56 in your parser rather than your scanner. But there is also an argument that you can produce better error messages by detecting the error lexically.

If you want to do so, you can use two patterns; the first one detects only correct numbers and the second one detects a larger set of strings, including all the erroneous strings (but not anything which could be legitimate). This relies on the matching behaviour of (f)lex: it always accepts the longest match, and if the longest token is matched by two or more rules, it uses the first matching rule.

For example, suppose you wanted to accept dots by themselves as '.', numbers as NUMBER tokens, and produce an error on numeric strings with more than one dot. You could do that with three rules:

  /* If the token is just a dot, match it here */
\.                             { return '.';    }
  /* Match integers without decimal points */
[[:digit:]]+                   { return INTEGER; }
  /* If the token is a number including a decimal point,
   * match it here. This pattern will also match just '.',
   * but the previous rules will be preferred.) */
[[:digit:]]*\.[[:digit:]]*     { return FLOAT; }
  /* This rule matches any sequence of dots and digits.
   * That will also match single dots and correct numbers, but
   * again, the previous rules are preferred. */
[.[:digit:]]+                  { /* signal error */
                                 return BADNUMBER; }

You need to be very careful with solutions like the above. For example, the last rule will match .. and ..., which might be valid tokens (or even valid sequences of . tokens.)

Suppose, for example, that your language permits "range" expressions like 4 .. 17 (meaning the list of integers from 4 to 17, or some such). Your users might expect 4..17 to be accepted as a range expression, but the above will produce a BADNUMBER error, even when you add the rule

".."                           { return RANGE; }

at the beginning, because 4.. will match BADNUMBER at a previous point in the scan.

In order to avoid false alerts, we need to modify the BADNUMBER rule to avoid matching strings which include two (or more) consecutive dots. And we also need to make sure that 4..17 is not lexed as 4. followed by .17. (This second problem could be avoided by insisting that . neither start not end a numeric token, but that might annoy some users.)

So, we start with the actual dot tokens:

"."                            { return '.'; }
".."                           { return RANGE; }
"..."                          { return ELLIPSIS; }

To avoid overmatching a number followed by .., we can use flex's trailing context operator. Here, we recognize a sequence of digits terminated by a . as a number only if the string is followed by something other than a .:

[[:digit:]]+                   { return INTEGER; }
  /* Change * to + so that we don't do numbers ending with . */
[[:digit:]]*(\.[[:digit:]]+)?  { return FLOAT; }
  /* Numbers which end with dot not followed by dot */
[[:digit:]]+\./[^.]            { return FLOAT; }

Now we need to fix the error rule. First, we limit it to recognizing strings where every dot is followed by a digit. Then, similar to the above, we do match the case where there is a trailing dot not followed by another dot:

[[:digit:]]*(\.[[:digit:]]+)+  { return BADNUMBER; }
[[:digit:]]*(\.[[:digit:]]+)+\./[^.] { return BADNUMBER; }