Search code examples
bisonflex-lexeryacclex

yyllocp->first_line returns uninitialized value in second iteration of a reEntrant Bison parser


I have a reEntrant parser which takes input from a string and has a structure to maintain context. A function is called with different input strings to be parsed. Relevant code of that function is:

void parseMyString(inputToBeParsed) {

 //LEXICAL COMPONENT - INITIATE LEX PROCESSING
   yyscan_t scanner;    
   YY_BUFFER_STATE  buffer;
   yylex_init_extra(&parseSupportStruct, &scanner );
   //yylex_init(&scanner);

   buffer = yy_scan_buffer(inputToBeParsed, i+2, scanner);

   if (buffer == NULL) {
       strcpy(errorStrings,"YY_BUFFER_STATE returned NULL pointer\n");
       return (-1);
   }


//BISON PART - THE ACTUAL PARSER
yyparse(scanner, &parseSupportStruct);

...

yylex_destroy(scanner);
...
}

My .l options are:

 %option noinput nounput noyywrap 8bit nodefault                                 
 %option yylineno
 %option reentrant bison-bridge bison-locations                                  
 %option extra-type="parseSupportStructType *"

Relevant lines from .y are:

  %define api.pure full
  %locations
  %param { yyscan_t scanner }
  %parse-param { parseSupportStructType* parseSupportStruct}
  %code {
    int yylex(YYSTYPE* yylvalp, YYLTYPE* yyllocp, yyscan_t scanner);
    void yyerror(YYLTYPE* yyllocp, yyscan_t unused, parseSupportStructType* parseSupportStruct,  const char* msg);
    char *yyget_text (yyscan_t);
    char *strcpy(char *, const char *);
  }
  %union {
     int numval;
     char *strval;
     double floatval; 
  }

In my parser, in some rules, I try to access yyllocp->first_line. In the first call to parseMyString(...), I get the correct value. The second time, I get some uninitialized value. Do I need to initialize yyllocp->first_line in each call to parseMyString? How and where? I know I have given partial, redacted code, to explain the situation. Will be happy to provide further details.

Using valgrind I have removed memory leaks to the best of my abilites but some third-party library issues are beyond my control.


Solution

  • Nothing in flex or bison will maintain the value of yylloc.

    Bison parsers (other than push parsers) will initialise that variable. (If you accept the default location type -- that is, you don't #define YYLTYPE -- yylloc will be initialised to {1, 1, 1, 1}. Otherwise, it will be zero-initialised, whatever that means for whatever type it is.) Bison also produces code which computes the location of a non-terminal based on the locations of the non-terminal's first and last children. Flex's generated code doesn't touch the location object at all.

    A flex scanner does automatically maintain yylineno if you ask enabled this feature with

    %option yylineno
    

    Flex can usually do that more efficiently than you can, and it handles all the corner cases (yyless, yymore, input(), REJECT). So if you want to track line numbers, I strongly recommend letting flex do it.

    But there is one important issue with flex's yylineno support. In a reentrant scanner, the line number is stored in each flex buffer, not in the scanner state object. That's almost certainly the correct place to store it, IMHO, because if you are using multiple buffers, they probably represent multiple input steams, and normally you'll want to cite the number of a line within its file. But yy_scan_buffer does not initialise this field. (And therefore neither do yy_scan_string and yy_scan_bytes, which are just wrappers around yy_scan_buffer.)

    So if you are using one of the yy_scan_* interfaces, you should reset yylineno by calling yyset_lineno immediately after yy_scan_*. In your case, this would be:

    buffer = yy_scan_buffer(inputToBeParsed, i+2, scanner);
    yyset_lineno(1, scanner);
    

    Once you've got yylineno, it's easy to maintain the yylloc object. Flex has a hook which lets you inject code just before any the action for a pattern is executed (even if the action is empty) and this hook can be used to automatically maintain yylloc. In this answer, I provide a simple example of this technique (which depends on yylineno being maintained by the flex-generated scanner):

    #define YY_USER_ACTION                                             \
      yylloc->first_line = yylloc->last_line;                          \
      yylloc->first_column = yylloc->last_column;                      \
      if (yylloc->last_line == yylineno)                               \
        yylloc->last_column += yyleng;                                 \
      else {                                                           \
        yylloc->last_line = yylineno;                                  \
        yylloc->last_column = yytext + yyleng - strrchr(yytext, '\n'); \
      }
    

    As the notes in that answer indicate, the above is not fully general, but it will work in many circumstances:

    This YY_USER_ACTION macro should work for any scanner which does not use yyless(), yymore(), input() or REJECT. Correctly coping with these features is not too difficult but it seemed out of scope here.

    You cannot handle yyless(), yymore() or REJECT before the action (since before the action it's not possible to know if they will be executed), so a more robust location-tracker in an application which used those features would have to include code to fix yylloc():

    • For yyless(), the above code for setting last_line and last_column can be re-executed after the yyless() call, since the flex scanner will fix yyleng and yylineno.

    • For REJECT, it is not possible to insert code after REJECT. The only way to handle it is to keep a backup of yylloc and restore it immediately before the REJECT macro. (I strongly advise against using REJECT. It's extremely inefficient and can almost always be replaced with the combination of a call to yyless() and a start condition.)

    • For yymore(), yylloc is still correct, but the next action must not overwrite the token start position. Getting that right would probably require maintaining a flag to indicate whether or not yymore() had been called.

    • For input(), if you want the characters read to be considered part of the current token, you could advance the end location in yylloc after the call to input() (which requires distinguishing between input() returning a newline, an end-of-file indicator, or a regular character). Alternatively, if you want the characters read with input() to not be considered part of any token, you would need to abandon the idea of using the end position of the previous token as the start position of the current token, which would require keeping a separation position value to be used as the start position of the next token.