I am writing a flex program to deal with string constants.
I want to return an ERROR token when the input file meets EOF inside a string.
I got the following error after the file meets EOF and "ERROR" is printed:
fatal flex scanner internal error--end of buffer missed
Here is my code: (a simplified version which can reproduce this error)
%option noyywrap
#define ERROR 300
#define STRING 301
char *text;
%x str
%%
\" {BEGIN(str); yymore();}
<str>\" {BEGIN(INITIAL); text=yytext; return STRING;}
<str>. {yymore();}
<str><<EOF>> {BEGIN(INITIAL); return ERROR;}
%%
int main(){
int token;
while((token=yylex())!=0){
if(token==STRING)
printf("string:%s\n",text);
else if(token==ERROR)
printf("ERROR\n");
}
return 0;
}
When I delete the yymore()
function call, the error disappeared and the program exited normally after printing "ERROR".
I wonder why this happens and I want to solve it without removing yymore()
.
You cannot continue the lexical scan after you receive an EOF indication, so your <str><<EOF>>
rule is incorrect, and that is what the error message indicates.
As with any undefined behaviour, there are circumstances in which the error may lead to arbitrary behaviour, including working as you incorrectly assumed it would work. (With your flex version, this happens if you don't use yymore
, for example.)
You need to ensure that the scanner loop is not reentered after the EOF is received. You could, for example, return an error code which indicates that no more tokens can be read (as opposed to a restartable error indication, if needed.) Or you could set a flag for the lexer which causes it to immediately return 0 after an unrecoverable error.
Here's an example of the second strategy (just the rules, since nothing else changes):
%%
/* indented code before the first pattern is inserted
* at the beginning of yylex, allowing declaration of
* variables. The fatal_error flag is declared static,
* since this is not a reentrable lexer. If it were
* reentrable, we'd put the flag in the lexer context
* (as part of the "extra data"), which would be a lot cleaner.
*/
static int fatal_error = 0;
/* If the error we last returned was fatal, we do
* not re-enter the scanner loop; we just return EOF
*/
if (fatal_error) {
fatal_error = 0; /* reset the flag for reuse */
return 0;
}
\" {BEGIN(str); yymore();}
<str>\" {BEGIN(INITIAL); text=yytext; return STRING;}
<str>. {yymore();}
<str><<EOF>> {BEGIN(INITIAL);
fatal_error = 1; /* Set the fatal error flag */
return ERROR;}
%%
Another possible solution is to use a "push parser", where yylex
calls the parser with each token, instead of the other way round. bison
supports this style, and it's often a lot more convenient; in particular, it allows an action to send more than one token to the parser, which in the case would obviate the need for a static local flag.