So I'm currently making a parser with Flex-Bison for python. The lexer part has been completed quite quickly. Here are some parts of the lexer that might be relevant for the current question :
%{
#include <stdio.h>
// many define
#define TOKEN_IDENTIFIER 259
// many define
#define TOKEN_WHITE 294
// many define
#define TOKEN_KEYWORD_FROM 306
// many define
#define TOKEN_KEYWORD_IMPORT 311
// many define
int displayToken(int);
void yyerror(char *sp);
numbers ((0|[1-9][0-9]*)(\.[0-9]+)?)
identifiers ([a-zA-Z\_][0-9a-zA-Z\_]*)
identation (^\t+)
whites (\ |\t|\r|\n)+
// many lines
"from" { return displayToken(TOKEN_KEYWORD_FROM); }
// many lines
"import" { return displayToken(TOKEN_KEYWORD_IMPORT); }
// many lines
{identifiers} { return displayToken(TOKEN_IDENTIFIER); }
{identation} { return displayToken(TOKEN_IDENTATION); }
{whites} { return displayToken(TOKEN_WHITE); }
// many lines
. { yyerror("Unknown token!!!"); }
%%
int displayToken(int token) {
#ifdef DEBUG
// logic to help debug
#endif
return token;
}
Now my parser only tries to parse lines such as :
import math
# or
from math import sqrt
So here is my code :
%{
#pragma GCC diagnostic ignored "-Wimplicit-function-declaration"
#include <stdio.h>
#include <stdlib.h>
int yylex();
%}
%define parse.error verbose
%token TOKEN_KEYWORD_IMPORT TOKEN_KEYWORD_FROM TOKEN_IDENTIFIER TOKEN_WHITE
%start input
%%
input: import_def
| from_def TOKEN_WHITE import_def
;
import_def: import TOKEN_WHITE identifier;
from_def: from TOKEN_WHITE identifier;
import: TOKEN_KEYWORD_IMPORT {
printf("found import\n");
};
from: TOKEN_KEYWORD_FROM {
printf("found from\n");
};
identifier: TOKEN_IDENTIFIER {
printf("found an identifier\n");
};
%%
int main (int argc, char **argv) {
#ifdef YYDEBUG
yydebug = 1;
#endif
yyparse();
}
int yywrap(void)
{
return 1;
}
int yyerror(char *s) {
fprintf(stderr, "error: %s\n", s);
exit(1);
}
Afterward, I do those commands:
flex python.l
bison python.y --debug
gcc lex.yy.c python.tab.c -lfl
./a.out
This will prompt my parser, in which I type one of the python import/from
of earlier. I always get something like this:
Starting
Starting parse
Entering state 0
Reading a token: from math import sqrt
Token ID 306 (TOKEN_KEYWORD_FROM)
Next token is token $undefined ()
error: syntax error, unexpected $undefined, expecting TOKEN_KEYWORD_IMPORT or TOKEN_KEYWORD_FROM
So i know that my lexer identifies the token properly, but for some reason the parser seems unaware of that token ID. I've tried to include the lex.yy.c
into the python.y
file, without success. I've tried to copy the define
in both file, without success either. I don't know what to do next.
The token numbers you have hardcoded into your lexer are not the ones that your parser is expecting. Hardcoding token numbers is a terrible idea; it is almost impossible to keep them synchronized between the lexer and the parser. If you got that style out of some kind of so-called "tutorial" or "guide", get rid of it and find a better one.
Bison is designed to make it easy for you to keep these token numbers in synch. Here's what you do:
Delete all of those #define
lines from your lexer. Ask bison to generate a header file:
bison --defines --debug python.y
Use the generated header (which will be called python.tab.h
) by putting this line in your lexer just below #include <stdin.h>
:
#include "python.tab.h"
Do not add that include to your parser. Only put it in the lexer.
Make sure that you recompile the lexer every time you recompile the parser, after the parser has created the new header file.
You can get rid of that #pragma
line by forward-declaring yyerror
. The recommended prototype is:
void yyerror(const char* msg);
Unless you call yyerror
yourself, there's no need for a return value; the bison-generated parser does not attempt to use yyerror
's return value.
yyerror
should not call exit(1)
; that's very unfriendly. If you just let it return, then yyparse()
will clean itself up and return 1.
You don't need yywrap
. Insert the following into your lexer prologue:
%option noyywrap noinput nounput
The first option tells flex not to use the yywrap
mechanism. The other two tell it not to generate the input()
and unput()
static functions, which will let you remove the #pragma
from your lexer file (you didn't show it, but I suppose it is there somewhere).
It's not recommended for the lexer to call exit(1)
either (which won't happen if you take that line out of yyerror
). But you do need to return something from the fallback action. You could return 0, which is equivalent to an EOF, but it's generally better to add to your %token
list an unused token with a suggestive name (INVALID_CHARACTER
, for example). Then in your lexer you can return INVALID_CHARACTER
and you'll get a sensible error message from the parser, without having to invoke yyerror
yourself.