Search code examples
cregexcompiler-constructionflex-lexerlexical-analysis

Lexical analyzer (Flex) throws lexical error after space following by tokens (and more)


As i run the lexical analyzer with the following example, it seems like it cannot recognize empty space between tokens, and tokens generated with the regex showed down.

lexical analyzer (Something.l) :

%{
#include <stdio.h>
#include <stdlib.h>

int yylex();
//void yyerror(const char *s);
void yyerror (const char * msg)
{
  fprintf(stderr, "C-like : %s\n", msg);
  exit(1);
}
int line_num = 1;

#include "y.tab.h"
#define T_eof   0
%}


%option noyywrap 


letter  [A-Za-z]
digit   [0-9]
id  letter(letter|digit|'_')*
num [1-9]digit*('.'digit*)?
string  '(digit|letter)*'
Empty   [\t\r]|" "
line    [\n]

%%

{line}      { line_num++ ; }
"mainclass" { printf("MAINCLASS ") ; return  (MAINCLASS) ; }
"public"    { printf("PUBLIC ") ; return (PUBLIC); }
"static"    { printf("STATIC ") ; return (STATIC) ; }
"void"      { printf("VOID ") ; return (VOID) ; }
"main"      { printf("MAIN ") ; return (MAIN) ; }
"println"   { printf("PRINTLN ") ; return (PRINTLN) ; }   
"int"       { printf("INT ") ; return (INT) ; }
"float"     { printf("FLOAT ") ; return (FLOAT) ; }
"for"       { printf("FOR ") ; return (FOR) ; }
"while"     { printf("WHILE ") ; return (WHILE) ; }
"if"        { printf("IF ") ; return (IF) ; }
"else"      { printf("ELSE ") ; return (ELSE) ; }
";"     { printf("Q ") ; return (Q) ; }
"=="        { printf("EQUAL ") ; return (EQUAL) ; }
"<="        { printf("SMALLEReq ") ; return (SMALLER) ; }
">="        { printf("BIGGEReq ") ; return (BIGGER) ; }
"!="        { printf("NOTEQUAL ") ; return (NOTEQUAL) ; }
{id}        { printf("ID ") ; return (ID) ; }
{num}       { printf("NUM ") ; return (NUM) ; }
{string}    { printf("STRING ") ; return (STRING) ; }
<<EOF>>     { printf("EOF ") ; return (EOF); }
.       { printf(" lexical error in Line : %d \n ", line_num); exit(1); }
{Empty}+    { printf("EMPTY ") ; /* nothing */ }
[\(\)\{\}]  { return yytext[0] ; }
%%
int main(){
    yylex();
    return 0;
}

Running the following example :

mainclass Fibonacci {
    public static void main ( )
    {
        int first, second, i, tmp;
        first=0;
        second=1;
        i=0;
        while (i<10)
        {
            i=i+1;
            tmp=first+second;
            println (tmp);
            first=second;
            second=tmp;
        }
    }
}

Output :

MAINCLASS

Adding one space in the beggining of the example :

lexical error in Line : 1

Adding tow or more spaces in the beggining of the example, we have got this output :

EMPTY MAINCLASS

Removing mainclass and leave the first space and the identifier in the beggining we have got :

EMPTY  lexical error in Line : 1

Solution

  • The order of rules in a flex file matters, and this is a particular example of that.

    You have the rule:

    .       { printf(" lexical error in Line : %d \n ", line_num); exit(1); }
    

    which will match any single character, which includes matching a space character.

    Later on in the file (actually, immediately following), you have

    {Empty}+    { printf("EMPTY ") ; /* nothing */ }
    

    which could also match a single space character. But if the token to be matched is a single space, then the first rule will win (precisely because when the same token is matched by more than one rule, the first rule wins).

    On the other hand, if there are two spaces, then the pattern {Empty}+ will match both of them, whereas . will match only one. In this case, {Empty}+ will win, because the longest match always wins.

    You should always place fallback rules at the very end of your scanner description (except possibly for <<EOF>> rules). That not only ensures that they will work as expected, it also put the rules where people will look for them.

    Note that there are a variety of other errors in your macro definitions, some of which are noted in a comment. These will also cause your scanner to reject valid inputs, so they need to be addressed.

    By and large, I recommend avoiding macros unless you have very complicated patterns in which the same subpattern appears several times, which is the use case they were intended for. {Empty} is not at all descriptive, so it forces the code reader (me, in this case) to search your source file for a definition. You could have used the Posix character class [[:space:]], which is well-known to anyone experienced with Flex. (It includes newline characters, but your newline rule is only used to increment the line number count; you could get Flex to do that for you by simply including %option yylineno.)