Search code examples
bisonflex-lexeryacclex

How to process NUL characters in a Flex lexer, given that 0 denotes end-of-file?


My input file consists of one byte, the NUL character (hex 0).

I have a Flex rule that matches the NUL character and the action returns it:

\0               { return(yytext[0]); }

Below is my complete Flex file. When I run it, I get no output. I conclude that the lexer is interpreting the value returned from my rule as the end-of-file signal. Yes? If so, how to process NUL symbols in a Flex lexer?

%option noyywrap
%% 
\0               { return(yytext[0]); }
%%
int main(int argc, char *argv[])
{ 
      yyin = fopen(argv[1], "r");
      int token = yylex();
      while ( token != 0 ) {
        switch(token) {
           default:
              printf("TOKEN: %c\n", yytext[0]);
          }
        token = yylex();
      } 
      fclose(yyin);
      return 0;
}

Solution

  • You're free to recognise NULs in your input stream as you see fit. But you cannot use 0 as a token number, because when yylex returns 0 that will be interpreted as meaning end of input by its caller (typically yyparse, but in this case your own main() program).

    I'm a bit puzzled by your statement:

    I conclude that the lexer is interpreting the value returned from my rule as the end-of-file signal.

    The lexer doesn't interpret the value returned from your rule at all. Your rules are part of the yylex() function, and when your rule executes return X, X is returned from yylex. There is no inner function which is called.

    So it's not yylex which is interpreting the value returned from your rule as the end-of-file signal. That interpretation is precisely located at the fifth line of your main() function:

        while ( token != 0 ) {
    

    Since you're not using a parser generated by bison/yacc, you're actually free to use whatever integer you like as an end-of-file return from yylex(). But you need to be aware that the generated yylex will return 0 from its default <<EOF>> rules; if you want 0 to mean something other than end of file, you'll need to add explicit <<EOF>> rules in every start condition which return what you chose to use. It's almost always simpler to stick with the standard 0, which means that you can't use it as a token number.

    So in order to handle a NUL as a single-byte token, you'll need to choose some integer other than 0 to represent that token, and thus you cannot use return yytext[0]; if yytext[0] might be a NUL.

    %option noyywrap
    %{
    #define NULL_TOKEN 257
    %}
    %% 
    \0               { return NULL_TOKEN; }
    %%
    int main(int argc, char *argv[])
    { 
          yyin = fopen(argv[1], "r");
          int token = yylex();
          while ( token != 0 ) {
            switch(token) {
               default:
                  printf("TOKEN: %d\n", token);
              }
            token = yylex();
          } 
          fclose(yyin);
          return 0;
    }