I'm writing a parser which has some tokens that are concatenated from multiple smaller rules, using yymore()
.
If it reaches EOF before the end of this composite token, I need it to return a special error-token to the parser. This is the same problem as in this question.
The answer there suggests to convert the parser to a "push parser" to solve this.
The Bison manual makes it pretty clear how to make a push parser part but I cannot find a similar instruction on how the lexer should look.
Let's take the following lexer:
%option noyywrap
%{
#include <string.h>
// Stub of the parser header file:
#define GOOD_STRING 1000
#define BAD_STRING 1001
char *yylval;
%}
%x STRING
%%
\" { BEGIN(STRING); yymore(); }
<STRING>{
\" { BEGIN(INITIAL); yylval = strdup(yytext); return GOOD_STRING; }
.|\n { yymore(); }
<<EOF>> { BEGIN(INITIAL); yylval = strdup(yytext); return BAD_STRING; }
}
.|\n { return yytext[0]; }
%%
void parser_stub()
{
int token;
while ((token = yylex()) > 0) {
if (token < 1000) {
printf("%i '%c'\n", token, (char)token);
} else {
printf("%i \"%s\"\n", token, yylval);
free(yylval);
}
}
}
int main(void)
{
parser_stub();
}
It doesn't work as a pull-parser because it continues parsing after encountering EOF, which ends in an error: fatal flex scanner internal error--end of buffer missed
.
(It works if yymore()
is not used but it still technically is an undefined behavior.)
In the rule <<EOF>>
it needs to emit 2 tokens: BAD_STRING
and 0
.
How do you convert a lexer into one suitable for a push-parser?
I'm guessing it involves replacing return
s with something that pushes a token to the parser without ending yylex()
but I haven't found a mention of such function / macro.
Is this just a case of having to implement it manually, without any support built-in into Flex?
Instead of setting yylval
and returning the token id, you call the push parser with the token id and semantic value as arguments. That's it. Flex provides nothing to help you write the return or the call, but that doesn't really seem like a big deal to me.
Sometimes it's convenient to use your own macro in both cases, to reduce boilerplate. I tend to do that. For push parsing, the main issue is error handling, since the push parser might return an error on any call.
By the way, even in stripped-down example code, you should never pass yytext
directly to the parser. That's probably the number one cause of mysterious lexing bugs. Also, Bison assigns token numbers for you. Putting your own definitions in the lexer implementation will almost inevitably lead to grief.
Here are some examples of answers I've written over the years with push lexers:
https://stackoverflow.com/a/63000285/1566221 (with a complex send macro)
This example uses Flex with Lemon (another push parser):
There are probably more :-)