Search code examples
parsingbison

Can't make bison to identify a token - always `$undefined`


So I'm currently making a parser with Flex-Bison for python. The lexer part has been completed quite quickly. Here are some parts of the lexer that might be relevant for the current question :

%{
#include <stdio.h>

// many define
#define TOKEN_IDENTIFIER 259
// many define
#define TOKEN_WHITE 294
// many define
#define TOKEN_KEYWORD_FROM 306
// many define
#define TOKEN_KEYWORD_IMPORT 311
// many define

int displayToken(int);
void yyerror(char *sp);

numbers         ((0|[1-9][0-9]*)(\.[0-9]+)?)
identifiers     ([a-zA-Z\_][0-9a-zA-Z\_]*)
identation      (^\t+)
whites          (\ |\t|\r|\n)+

// many lines
"from"          { return displayToken(TOKEN_KEYWORD_FROM); }
// many lines
"import"        { return displayToken(TOKEN_KEYWORD_IMPORT); }
// many lines

{identifiers}   { return displayToken(TOKEN_IDENTIFIER); }
{identation}    { return displayToken(TOKEN_IDENTATION); }
{whites}        { return displayToken(TOKEN_WHITE); }

// many lines
.               { yyerror("Unknown token!!!"); }
%%

int displayToken(int token) {

#ifdef DEBUG

// logic to help debug

#endif

    return token;
}

Now my parser only tries to parse lines such as :

import math
# or
from math import sqrt

So here is my code :

%{
    #pragma GCC diagnostic ignored "-Wimplicit-function-declaration"
    #include <stdio.h>
    #include <stdlib.h>

    int yylex();
%}

%define parse.error verbose

%token TOKEN_KEYWORD_IMPORT TOKEN_KEYWORD_FROM TOKEN_IDENTIFIER TOKEN_WHITE
%start input

%%
input: import_def
     | from_def TOKEN_WHITE import_def
;

import_def: import TOKEN_WHITE identifier;

from_def: from TOKEN_WHITE identifier;

import: TOKEN_KEYWORD_IMPORT {
    printf("found import\n");
};

from: TOKEN_KEYWORD_FROM {
    printf("found from\n");
};

identifier: TOKEN_IDENTIFIER {
    printf("found an identifier\n");
};

%%

int main (int argc, char **argv) {
    #ifdef YYDEBUG
    yydebug = 1;
    #endif
    yyparse();
}

int yywrap(void)
{
   return 1;
}

int yyerror(char *s) {
    fprintf(stderr, "error: %s\n", s);
    exit(1);
}

Afterward, I do those commands:

flex python.l
bison python.y --debug
gcc lex.yy.c python.tab.c -lfl
./a.out

This will prompt my parser, in which I type one of the python import/from of earlier. I always get something like this:

Starting
Starting parse
Entering state 0
Reading a token: from math import sqrt
Token ID 306 (TOKEN_KEYWORD_FROM)
Next token is token $undefined ()
error: syntax error, unexpected $undefined, expecting TOKEN_KEYWORD_IMPORT or TOKEN_KEYWORD_FROM

So i know that my lexer identifies the token properly, but for some reason the parser seems unaware of that token ID. I've tried to include the lex.yy.c into the python.y file, without success. I've tried to copy the define in both file, without success either. I don't know what to do next.


Solution

  • The token numbers you have hardcoded into your lexer are not the ones that your parser is expecting. Hardcoding token numbers is a terrible idea; it is almost impossible to keep them synchronized between the lexer and the parser. If you got that style out of some kind of so-called "tutorial" or "guide", get rid of it and find a better one.

    Bison is designed to make it easy for you to keep these token numbers in synch. Here's what you do:

    1. Delete all of those #define lines from your lexer. Ask bison to generate a header file:

      bison --defines --debug python.y
      
    2. Use the generated header (which will be called python.tab.h) by putting this line in your lexer just below #include <stdin.h>:

      #include "python.tab.h"
      

      Do not add that include to your parser. Only put it in the lexer.

    3. Make sure that you recompile the lexer every time you recompile the parser, after the parser has created the new header file.

    Additional suggestions

    You can get rid of that #pragma line by forward-declaring yyerror. The recommended prototype is:

    void yyerror(const char* msg);
    

    Unless you call yyerror yourself, there's no need for a return value; the bison-generated parser does not attempt to use yyerror's return value.

    yyerror should not call exit(1); that's very unfriendly. If you just let it return, then yyparse() will clean itself up and return 1.

    You don't need yywrap. Insert the following into your lexer prologue:

    %option noyywrap noinput nounput
    

    The first option tells flex not to use the yywrap mechanism. The other two tell it not to generate the input() and unput() static functions, which will let you remove the #pragma from your lexer file (you didn't show it, but I suppose it is there somewhere).

    It's not recommended for the lexer to call exit(1) either (which won't happen if you take that line out of yyerror). But you do need to return something from the fallback action. You could return 0, which is equivalent to an EOF, but it's generally better to add to your %token list an unused token with a suggestive name (INVALID_CHARACTER, for example). Then in your lexer you can return INVALID_CHARACTER and you'll get a sensible error message from the parser, without having to invoke yyerror yourself.