I'm writing a sort of a script language interpreter. I would like it to be able to handle (ignore) things like shebang, utf-bom, or other such thing that can appear on the beginning of a file.
The problem is that I cannot be sure that my growing grammar won't at some point have a rule that could match one of those things. (It's unlikely but you don't get reliable programs by just ignoring unlikely problems.) Therefore, I would like to do it properly and ignore those things only if they are at the beginning of a file.
Let's focus on a shebang in the example. I've written some simple grammar that illustrates the problems I'm facing.
Lexer:
%%
#!.+ { printf("shebang: \"%s\"\n", yytext + 2); }
[[:alnum:]_]+ { printf("id: %s\n", yytext); return 1; }
#[^#]*# { printf("thingy: \"%s\"\n", yytext); return 2; }
[[:space:]] ;
. { printf("error: '%c'\n", yytext[0]); }
%%
int main() { while (yylex()); return 0; }
int yywrap() { return 1; }
Input file:
#!my-program
# some multiline
thingy #
aaa bbb
ccc#!not a shebang#ddd
eee
Expected output:
shebang: "my-program"
thingy: "# some multiline
thingy #"
id: aaa
id: bbb
id: ccc
thingy: "#!not a shebang#"
id: ddd
id: eee
Actual output:
thingy: "#!my-program
#"
id: some
id: multiline
id: thingy
thingy: "#
aaa bbb
ccc#"
error: '!'
id: not
id: a
id: shebang
error: '#'
id: ddd
id: eee
I figured that this is a good case to use start conditions. I managed to use them to write a lexer that does work, however, it's rather ugly:
%s MAIN
%%
<INITIAL>#!.+ { printf("shebang: \"%s\"\n", yytext + 2); BEGIN(MAIN); }
<INITIAL>""/(?s:.) { BEGIN(MAIN); }
[[:alnum:]_]+ { printf("id: %s\n", yytext); return 1; }
<MAIN>#[^#]*# { printf("thingy: \"%s\"\n", yytext); return 2; }
[[:space:]] ;
. { printf("error: '%c'\n", yytext[0]); }
%%
int main() { while (yylex()); return 0; }
int yywrap() { return 1; }
Notice that I had to specify the start condition MAIN
before the rule #[^#]*#
.
It's because it would otherwise collide with the shebang rule #!.+
.
Unfortunately, the INITIAL
start condition is inclusive, which means I had to specifically exclude from it any rule that would cause problems. I have to remember about it every time I write a new rule (AKA I'll forget about it).
Is there some way to make the INITIAL
exclusive or choose a different start condition to be the default?
Here's a simpler solution, assuming you're using Flex (as per your flex-lexer tag):
%option noinput nounput noyywrap nodefault yylineno
%{
#define YY_USER_INIT BEGIN(STARTUP);
%}
%x STARTUP
%%
<STARTUP>#!.* { BEGIN(INITIAL); printf("Shebang: \"%s\"\n", yytext+2); }
<STARTUP>.|\n { BEGIN(INITIAL); yyless(0); }
/* Rest is INITIAL */
[[:alnum:]_]+ { printf("id: %s\n", yytext); return 1; }
#[^#]*# { printf("thingy: \"%s\"\n", yytext); return 2; }
[[:space:]] ;
. { printf("error: '%c'\n", yytext[0]); }
rici$ flex -o shebang.c shebang.l
rici$ gcc -Wall -o shebang shebang.c -lfl
rici$ ./shebang <<"EOF"
> #!my-program
> # some multiline
> thingy #
> aaa bbb
> ccc#!not a shebang#ddd
> eee
> EOF
Shebang: "my-program"
thingy: "# some multiline
thingy #"
id: aaa
id: bbb
id: ccc
thingy: "#!not a shebang#"
id: ddd
id: eee
%option
line:
yywrap
;yylineno
YY_USER_INIT
is executed precisely once, when the scanner starts up. It executes before any of Flex's initialization code; fortunately, Flex's initialization code does not change the start condition if it's already been set.yyless(0)
causes the current token to be rescanned. (The argument doesn't have to be 0; it truncates the current token to that length and efficiently puts the rest back into the input stream.)-lfl
includes yywrap()
(although in this case, it's not used), and a simple main()
definition rather similar to the one in your example.(1) and (2) are Flex extensions. (3) and (4) should be available in any lex which conforms to Posix, with the exception that the Posix lex libary is linked with -ll
.