Search code examples
flex-lexerlex

Match rule only at the begining of a file


Problem

I'm writing a sort of a script language interpreter. I would like it to be able to handle (ignore) things like shebang, utf-bom, or other such thing that can appear on the beginning of a file.

The problem is that I cannot be sure that my growing grammar won't at some point have a rule that could match one of those things. (It's unlikely but you don't get reliable programs by just ignoring unlikely problems.) Therefore, I would like to do it properly and ignore those things only if they are at the beginning of a file.

Let's focus on a shebang in the example. I've written some simple grammar that illustrates the problems I'm facing.

Lexer:

%%

#!.+            { printf("shebang: \"%s\"\n", yytext + 2); }

[[:alnum:]_]+   { printf("id: %s\n", yytext); return 1; }
#[^#]*#         { printf("thingy: \"%s\"\n", yytext); return 2; }
[[:space:]]     ;   
.               { printf("error: '%c'\n", yytext[0]); }

%%

int main() { while (yylex()); return 0; }
int yywrap() { return 1; }

Input file:

#!my-program
# some multiline
thingy #
aaa bbb 
ccc#!not a shebang#ddd
eee

Expected output:

shebang: "my-program"
thingy: "# some multiline
thingy #"
id: aaa 
id: bbb 
id: ccc 
thingy: "#!not a shebang#"
id: ddd 
id: eee

Actual output:

thingy: "#!my-program
#"
id: some
id: multiline
id: thingy
thingy: "#
aaa bbb 
ccc#"
error: '!' 
id: not 
id: a
id: shebang
error: '#' 
id: ddd 
id: eee

My (bad?) solution

I figured that this is a good case to use start conditions. I managed to use them to write a lexer that does work, however, it's rather ugly:

%s MAIN

%%

<INITIAL>#!.+   { printf("shebang: \"%s\"\n", yytext + 2); BEGIN(MAIN); }
<INITIAL>""/(?s:.) { BEGIN(MAIN); }

[[:alnum:]_]+   { printf("id: %s\n", yytext); return 1; }
<MAIN>#[^#]*#   { printf("thingy: \"%s\"\n", yytext); return 2; }
[[:space:]]     ;   
.               { printf("error: '%c'\n", yytext[0]); }

%%

int main() { while (yylex()); return 0; }
int yywrap() { return 1; }

Notice that I had to specify the start condition MAIN before the rule #[^#]*#. It's because it would otherwise collide with the shebang rule #!.+. Unfortunately, the INITIAL start condition is inclusive, which means I had to specifically exclude from it any rule that would cause problems. I have to remember about it every time I write a new rule (AKA I'll forget about it).

Is there some way to make the INITIAL exclusive or choose a different start condition to be the default?


Solution

  • Here's a simpler solution, assuming you're using Flex (as per your tag):

    %option noinput nounput noyywrap nodefault yylineno
    
    %{
    #define YY_USER_INIT BEGIN(STARTUP);
    %}
    %x STARTUP
    
    %%
    <STARTUP>#!.*    { BEGIN(INITIAL); printf("Shebang: \"%s\"\n", yytext+2); }
    <STARTUP>.|\n    { BEGIN(INITIAL); yyless(0); }
      /* Rest is INITIAL */
    [[:alnum:]_]+   { printf("id: %s\n", yytext); return 1; }
    #[^#]*#         { printf("thingy: \"%s\"\n", yytext); return 2; }
    [[:space:]]     ;   
    .               { printf("error: '%c'\n", yytext[0]); }
    

    Test:

    rici$ flex -o shebang.c shebang.l
    rici$ gcc -Wall -o shebang shebang.c -lfl
    rici$ ./shebang <<"EOF"
    > #!my-program
    > # some multiline
    > thingy #
    > aaa bbb 
    > ccc#!not a shebang#ddd
    > eee
    > EOF
    Shebang: "my-program"
    thingy: "# some multiline
    thingy #"
    id: aaa
    id: bbb
    id: ccc
    thingy: "#!not a shebang#"
    id: ddd
    id: eee
    

    Notes:

    1. The %option line:
      • prevents "Unused function" warnings;
      • removes the need for yywrap;
      • shows an error if there's some possible input which doesn't match any pattern;
      • counts input lines in the global yylineno
    2. The macro YY_USER_INIT is executed precisely once, when the scanner starts up. It executes before any of Flex's initialization code; fortunately, Flex's initialization code does not change the start condition if it's already been set.
    3. yyless(0) causes the current token to be rescanned. (The argument doesn't have to be 0; it truncates the current token to that length and efficiently puts the rest back into the input stream.)
    4. The library -lfl includes yywrap() (although in this case, it's not used), and a simple main() definition rather similar to the one in your example.

    (1) and (2) are Flex extensions. (3) and (4) should be available in any lex which conforms to Posix, with the exception that the Posix lex libary is linked with -ll.