Search code examples
parsingbisonflex-lexerlexer

Use yylex() to get the list of token types from an input string


I have a CLI that was made using Bison and Flex which has grown large and complicated, and I'm trying to get the complete sequence of tokens (yytokentype or the corresponding yytranslate Bison symbol numbers) for a given input string to the parser.

Ideally, every time yyerror() is called I want to store the sequence of tokens that were identified during parse. I don't need to know the yylval's, states, actions, etc, just the token list resulting from the string input to the buffer.

If a straightforward way of doing this doesn't exist, then just a stand-alone way of going from string --> yytokentypes will work.

The below code just has debugging printouts, which I'll change to storing it in the place I want as soon as I figure out how to get the tokens.

// When an error condition is reached, yylex() to get the yytokentypes
void yyerror(const char *s)
{
    std::cerr<<"LEX\n";
    int tok; // yytokentype
    do
    {
        tok = yylex();
        std::cerr<<tok<<",";
    }while(tok);
    std::cerr<<"LEX\n";
}

Solution

  • A simpler solution is to just change the name of the lexer using the YY_DECL macro and then add a definition of yylex at the end:

    %{
    // ...
    #include "parser.tab.h"
    #define YY_DECL static int wrapped_lexer(void)
    %}
    
    %%
      /* rules */
    %%
    int yylex(void) {
      int token = wrapped_lexer();
      /* do something with the token */
      return token;
    }
    

    Having said that, unless the source code is read-once for some reason, it's probably faster on the whole to rescan the input only if an error is encountered rather than saving the token list in case an error is an encountered. Lexing is really pretty fast, and in many use cases, syntactically correct inputs are more common than erroneous ones.