Using flex, how can I keep yytext contents when EOF is reached and input is provided via YY_INPUT?

When a scanner generated by flex-lexer encounters end-of-file, it loses the content of yytext[] that was left there by yymore() calls in previous rules. This erroneous behavior ony happens if YY_INPUT() is redefined.

This may be a bug in flex, but it seems likely that there's something missing -- something else that a flex scanner definition should provide when it redefines YY_INPUT().

I have tested using flex 2.5.35 on both Ubuntu 12.04.1 and on Windows 7. On both systems, the scanner loses the yytext[] content at EOF if the text to be scanned is provided via an explicit definition of YY_INPUT().

Below is a sample flex scanner (flex-test.l) that is intended to read and print HTML comments, even if the last comment is unterminated. It works correctly when its input is provided via yy_scan_string(), but fails when its input is instead provided by an explicit definition of YY_INPUT(). In the sample code, #if's are used to select between the yy_scan_string() and the YY_INPUT() implementations. Specifically, the expected output:

Begin comment: <!--
More comment:  <!--incomplete
EOF comment:   <!--incomplete

appears if the scanner is built using

flex --nounistd flex-test.l && gcc -DREDEFINE_YY_INPUT=0 lex.yy.c

But if the scanner is built using

flex --nounistd flex-test.l && gcc -DREDEFINE_YY_INPUT=1 lex.yy.c

(changing the =0 to =1), then this incorrect output appears:

Begin comment: <!--
More comment:  <!--incomplete
EOF comment:

Notice the absence of any comment text in that last line of output.

Here is the sample code:

/* A scanner demonstrating bad interaction between yymore() and <<EOF>>
 * when YY_INPUT() is redefined: specifically, yytext[] content is lost. */

%{
#include <stdio.h>

int yywrap(void) { return 1; }

#if REDEFINE_YY_INPUT

  #define MIN(a,b) ((a)<(b) ? (a) : (b))

  const char *source_chars;
  size_t source_length;

  #define set_data(s) (source_chars=(s), source_length=strlen(source_chars))

  size_t get_data(char *buf, size_t request_size) {
    size_t copy_size = MIN(request_size, source_length);
    memcpy(buf, source_chars, copy_size);
    source_chars += copy_size;
    source_length -= copy_size;
    return copy_size;
  }

  #define YY_INPUT(buf,actual,ask) ((actual)=get_data(buf,ask))

#endif

%}

%x COMM

%%

"<!--"          printf("Begin comment: %s\n", yytext); yymore(); BEGIN(COMM);
<COMM>[^-]+     printf("More comment:  %s\n", yytext); yymore();
<COMM>.         printf("More comment:  %s\n", yytext); yymore();
<COMM>--+\ *[>] printf("End comment:   %s\n", yytext); BEGIN(INITIAL);
<COMM><<EOF>>   printf("EOF comment:   %s\n", yytext); BEGIN(INITIAL); return 0;

.               printf("Other:         %s\n", yytext);

<<EOF>>         printf("EOF:           %s\n", yytext); return 0;
%%

int main(int argc, char **argv) {
  char *text = "<!--incomplete";

  #if REDEFINE_YY_INPUT
    set_data(text);
    yylex();
  #else
    YY_BUFFER_STATE state = yy_scan_string(text);
    yylex();
    yy_delete_buffer(state);
  #endif
}

Solution

This problem has been in flex since, probably, forever. Basically, when flex gets an EOF from its current buffer, it processes the last token and then reinitializes the buffer, which effectively throws the current token away, even if it had been "saved" with yymore(). (It actually initializes the first two characters with NULs, but that's sufficient to destroy it.) It then calls yywrap(), which has the option of providing another buffer (file).

This behaviour is usually harmless, because usually tokens are not allowed to span two different input files, but it would sometimes be nice to have the option. Not nice enough that anyone has bothered to fix it in the quarter century of flex's existence, though.

The unfortunate consequence is that you can't use yytext after you get an EOF, since the buffer reset has already been done even though there are no more input files. (yyleng isn't right, either; it has not been reset yet, and it has also been incremented with the NUL which triggered the EOF.)

There's a hack in the implementation of yy_scan_string which sets the newly-created buffer's yy_fill_buffer flag to 0, meaning that no attempt should be made to refill the buffer. That prevents the buffer reset, but does not protect yyleng, which is still incorrect. I'd consider the preservation of yytext pure luck, though.

If flex were being actively maintained, I'd suggest adding to the flex manual a comment that yytext and yyleng are undefined in an <<EOF>> rule, and maybe even think about fixing yymore() so that it allows spanning input buffers (or documenting the fact that it doesn't).

In short, you have two options:

1) Just use yy_scan_string or yy_scan_buffer (with all appropriate caveats), and hope that no-one reverts the hack which allows you to look at yytext in the <<EOF>> rule. I don't know how future-proof that hope is, but there's nothing forcing you to upgrade.

But you're probably better off with:

2) Use your own buffer to keep the accumulated token string.

Option (2) is not actually all that expensive; if the token strings are large, it is probably better than using yymore because flex really wasn't designed to handle large tokens. For comments, which can be quite sizable, you'll probably find that maintaining your own buffer will be a lot faster, as well as being a lot more predictable.