I have a reEntrant parser which takes input from a string and has a structure to maintain context. A function is called with different input strings to be parsed. Relevant code of that function is:
void parseMyString(inputToBeParsed) {
//LEXICAL COMPONENT - INITIATE LEX PROCESSING
yyscan_t scanner;
YY_BUFFER_STATE buffer;
yylex_init_extra(&parseSupportStruct, &scanner );
//yylex_init(&scanner);
buffer = yy_scan_buffer(inputToBeParsed, i+2, scanner);
if (buffer == NULL) {
strcpy(errorStrings,"YY_BUFFER_STATE returned NULL pointer\n");
return (-1);
}
//BISON PART - THE ACTUAL PARSER
yyparse(scanner, &parseSupportStruct);
...
yylex_destroy(scanner);
...
}
My .l options are:
%option noinput nounput noyywrap 8bit nodefault
%option yylineno
%option reentrant bison-bridge bison-locations
%option extra-type="parseSupportStructType *"
Relevant lines from .y are:
%define api.pure full
%locations
%param { yyscan_t scanner }
%parse-param { parseSupportStructType* parseSupportStruct}
%code {
int yylex(YYSTYPE* yylvalp, YYLTYPE* yyllocp, yyscan_t scanner);
void yyerror(YYLTYPE* yyllocp, yyscan_t unused, parseSupportStructType* parseSupportStruct, const char* msg);
char *yyget_text (yyscan_t);
char *strcpy(char *, const char *);
}
%union {
int numval;
char *strval;
double floatval;
}
In my parser, in some rules, I try to access yyllocp->first_line. In the first call to parseMyString(...), I get the correct value. The second time, I get some uninitialized value. Do I need to initialize yyllocp->first_line in each call to parseMyString? How and where? I know I have given partial, redacted code, to explain the situation. Will be happy to provide further details.
Using valgrind I have removed memory leaks to the best of my abilites but some third-party library issues are beyond my control.
Nothing in flex or bison will maintain the value of yylloc
.
Bison parsers (other than push parsers) will initialise that variable. (If you accept the default location type -- that is, you don't #define YYLTYPE
-- yylloc
will be initialised to {1, 1, 1, 1}
. Otherwise, it will be zero-initialised, whatever that means for whatever type it is.) Bison also produces code which computes the location of a non-terminal based on the locations of the non-terminal's first and last children. Flex's generated code doesn't touch the location object at all.
A flex scanner does automatically maintain yylineno
if you ask enabled this feature with
%option yylineno
Flex can usually do that more efficiently than you can, and it handles all the corner cases (yyless
, yymore
, input()
, REJECT
). So if you want to track line numbers, I strongly recommend letting flex do it.
But there is one important issue with flex's yylineno
support. In a reentrant scanner, the line number is stored in each flex buffer, not in the scanner state object. That's almost certainly the correct place to store it, IMHO, because if you are using multiple buffers, they probably represent multiple input steams, and normally you'll want to cite the number of a line within its file. But yy_scan_buffer
does not initialise this field. (And therefore neither do yy_scan_string
and yy_scan_bytes
, which are just wrappers around yy_scan_buffer
.)
So if you are using one of the yy_scan_*
interfaces, you should reset yylineno
by calling yyset_lineno
immediately after yy_scan_*
. In your case, this would be:
buffer = yy_scan_buffer(inputToBeParsed, i+2, scanner);
yyset_lineno(1, scanner);
Once you've got yylineno
, it's easy to maintain the yylloc
object. Flex has a hook which lets you inject code just before any the action for a pattern is executed (even if the action is empty) and this hook can be used to automatically maintain yylloc
. In this answer, I provide a simple example of this technique (which depends on yylineno
being maintained by the flex-generated scanner):
#define YY_USER_ACTION \
yylloc->first_line = yylloc->last_line; \
yylloc->first_column = yylloc->last_column; \
if (yylloc->last_line == yylineno) \
yylloc->last_column += yyleng; \
else { \
yylloc->last_line = yylineno; \
yylloc->last_column = yytext + yyleng - strrchr(yytext, '\n'); \
}
As the notes in that answer indicate, the above is not fully general, but it will work in many circumstances:
This
YY_USER_ACTION
macro should work for any scanner which does not useyyless()
,yymore()
,input()
orREJECT
. Correctly coping with these features is not too difficult but it seemed out of scope here.
You cannot handle yyless()
, yymore()
or REJECT
before the action (since before the action it's not possible to know if they will be executed), so a more robust location-tracker in an application which used those features would have to include code to fix yylloc()
:
For yyless()
, the above code for setting last_line
and last_column
can be re-executed after the yyless()
call, since the flex scanner will fix yyleng
and yylineno
.
For REJECT
, it is not possible to insert code after REJECT
. The only way to handle it is to keep a backup of yylloc
and restore it immediately before the REJECT
macro. (I strongly advise against using REJECT
. It's extremely inefficient and can almost always be replaced with the combination of a call to yyless()
and a start condition.)
For yymore()
, yylloc
is still correct, but the next action must not overwrite the token start position. Getting that right would probably require maintaining a flag to indicate whether or not yymore()
had been called.
For input()
, if you want the characters read to be considered part of the current token, you could advance the end location in yylloc
after the call to input()
(which requires distinguishing between input()
returning a newline, an end-of-file indicator, or a regular character). Alternatively, if you want the characters read with input()
to not be considered part of any token, you would need to abandon the idea of using the end position of the previous token as the start position of the current token, which would require keeping a separation position value to be used as the start position of the next token.