Search code examples
compilationbisonflex-lexeryacclex

How can flex return multiple terminals at one time


In order to make my question easy to understand I want to use the following example:

The following code is called nonblock do-loop in fortran language

DO 20 I=1, N      ! line 1
DO 20 J=1, N      ! line 2
    ! more codes
20  CONTINUE      ! line 4

Pay attention that the label 20 at line 4 means the end of both the inner do-loop and the outer do-loop.

I want my flex program to parse the feature correctly: when flex reads the label 20, it will return ENDDO terminal twice.

Firstly, because I also use bison, so every time bison calls yylex() to get one terminal. If I can ask bison to get terminals from yylex() in some cases, and from another function in other cases, maybe I could solve this problem, however, I got no idea here then.

Of course there are some workarounds, for eample, I can use flex's start condition but I don't think it is a good solution. So I ask if there's any way to solve my question without a workaround?


Solution

  • It is easy enough to modify the lexical scanner produced by (f)lex to implement a token queue, but that is not necessarily the optimal solution. (See below for a better solution.) (Also, it is really not clear to me that for your particular problem, fabricating the extra token in the lexer is truly appropriate.)

    The general approach is to insert code at the top of the yylex function, which you can do by placing the code immediately after the %% line and before the first rule. (The code must be indented so that it is not interpreted as a rule.) For non-reentrant scanners, this will typically involve the use of a local static variable to hold the queue. For a simple but dumb example, using the C API but compiling with C++ so as to have access to the C++ standard library:

    %%
      /* This code will be executed each time `yylex` is called, before
       * any generated code. It may include declarations, even if compiled
       * with C89.
       */
      static std::deque<int> tokenq;
      if (!tokenq.empty()) {
        int token = tokenq.front();
        tokenq.pop_front();
        return token;
      }
    [[:digit:]]+  { /* match a number and return that many HELLO tokens */
                       int n = atoi(yytext);
                       for (int i = 0; i < n; ++i)
                         tokenq.push_back(HELLO);
                  }
    

    The above code makes no attempt to provide a semantic value for the queued tokens; you could achieve that using something like a std::queue<std::pair<int, YYSTYPE>> for the token queue, but the fact that YYSTYPE is typically a union will make for some complications. Also, if that were the only reason to use the token queue, it is obvious that it could be replaced with a simple counter, which would be much more efficient. See, for example, this answer which does something vaguely similar to your question (and take note of the suggestions in Note 1 of that answer).

    Better alternative: Use a push parser

    Although the token queue solution is attractive and simple, it is rarely the best solution. In most cases, code will be clearer and easier to write if you request bison to produce a "push parser". With a push parser, the parser is called by the lexer every time a token is available. This makes it trivial to return multiple tokens from a lexer action; you just call the parser for each token. Similarly, if a rule doesn't produce any tokens, it simply fails to call the parser. In this model, the only lexer action which actually returns is the <<EOF>> rule, and it only does so after calling the parser with the END token to indicate that parsing is complete.

    Unfortunately, the interface for push parsers is not only subject to change, as that manual link indicates; it is also very badly documented. So here is a simple but complete example which shows how it is done.

    The push parser keeps its state in a yypstate structure, which needs to be passed to the parser on each call. Since the lexer is called only once for each input file, it is reasonable for the lexer to own that structure, which can be done as above with a local static variable [Note 1]: the parser state is initialized when yylex is called, and the EOF rule deletes the parser state in order to reclaim whatever memory it is using.

    It is usually most convenient to build a reentrant push parser, which means that the parser does not rely on the global yylval variable [Note 2]. Instead, a pointer to the semantic value must be provided as an additional argument to yypush_parse. If your parser doesn't refer to the semantic value for the particular token type, you can provide NULL for this argument. Or, as in the code below, you can use a local semantic value variable in the lexer. It is not necessary that every call to the push parser provide the same pointer. In all, the changes to the scanner definition are minimal:

    %%
      /* Initialize a parser state object */
      yypstate* pstate = yypstate_new();
      /* A semantic value which can be sent to the parser on each call */
      YYSTYPE yylval;
      /* Some example scanner actions */
    "keyword"    {  /* Simple keyword which just sends a value-less token */
                    yypush_parse(pstate, TK_KEYWORD, NULL); /* See Note 3 */
                 }
    [[:digit:]]+ { /* Token with a semantic value */
                   yylval.num = atoi(yytext);
                   yypush_parse(pstate, TK_NUMBER, &yylval);
                 }
    "dice-roll"  { /* sends three random numbers */
                   for (int i = 0; i < 2; ++i) {
                     yylval.num = rand() % 6;
                     yypush_parse(pstate, TK_NUMBER, &yylval);
                 }
    <<EOF>>      { /* Obligatory EOF rule */
                   /* Send the parser the end token (0) */
                   int status = yypush_parse(pstate, 0, NULL);
                   /* Free the pstate */
                   yypstate_delete(pstate);
                   /* return the parser status; 0 is success */
                   return status;
                 }
    

    In the parser, not much needs to be changed at all, other than adding the necessary declarations: [Note 4]

    %define api.pure full
    %define api.push-pull push
    

    Notes

    1. If you were building a reentrant lexer as well, you would use the extra data section of the lexer state object instead of static variables.

    2. If you are using location objects in your parser to track source code locations, this also applies to yylloc.

    3. The example code does not do a good job of detecting errors, since it doesn't check return codes from the calls to yypush_parse. One solution I commonly use is some variant on the macro SEND:

      #define SEND(token) do {                              \
        int status = yypush_parse(pstate, token, &yylval);  \
        if (status != YYPUSH_MORE) {                        \
          yypstate_delete(pstate);                          \
          return status;                                    \
        }                                                   \
      } while (0)
      

      It's also possible to use a goto to avoid the multiple instances of the yypstate_delete and return. YMMV.

    4. You may have to modify the prototype of yyerror. If you are using locations and/or providing extra parameters to the push_parser, the location object and/or the extra parameters will also be present in the yyerror call. (The error string is always the last parameter.) For whatever reason, the parser state object is not provided to yyerror, which means that the yyerror function no longer has access to variables such as yych, which are now members of the yypstate structure rather than being global variables, so if you use these variables in your error reporting (which is not really recommended practice), then you will have to find an alternative solution.