Search code examples
compiler-constructionlex

How to match the line begin?


I was writing the cat(1) utility with lex. When I consider how to implement option -n, i.e. number every line. but I have to write something like this:

^. {
printf("%8d  ", ++lino);
ECHO;
}

I know the end of line(EOL) could matched use anchor $ and \n, so I wonder if there's something alike to match the begin of line(BOL) anchor, so I don't have to use the ECHO;


Solution

  • (I agree with the comment by Joachim Pileborg that lex is not the tool for implementing cat. The rest of this answer is in the spirit of explaining a bit about lex.)

    1. The provided lex program will not work if there are empty lines in the input, because ^. does not match an empty line. (In lex, . does not match a newline character.) So a reasonably minimal (f)lex input file would be:

      %options noyywrap noinput nounput
      %%
        int lino = 0;
      ^(.|\n)    { printf("%8d   %c", ++lino, *yytext); }
      

      Here, I just print out the matched token in the printf, which is the equivalent to using ECHO. So it does not really "eliminate" the ECHO.

    2. (f)lex rules must match at least one character. So it wouldn't really be possible for a pattern to consist only of $, any more than it would be possible for a pattern to consist only of ^ (which is a BOL anchor). In that sense, the answer to your question is simply "no".

    3. A more easily-understood (and probably more efficient) solution is to actually match each line. This solution never uses ECHO, not even in the default rule, so I've told flex to not generate a default rule:

      %options noyywrap noinput nounput nodefault
      %%
        int lino = 0;
      .*\n?    { printf("%8d   %s", ++lino, yytext); }
      

      That's not quite perfect, because it will truncate lines which contain a NUL character. (That is, the printf will effectively truncate the line; the line will be parsed correctly.) To fix it, it's necessary to use fwrite instead of printf:

      %options noyywrap noinput nounput nodefault
      %%
        int lino = 0;
      .*\n?    { printf("%8d   %s", ++lino);
                 fwrite(yytext, 1, yyleng, yyout); }
      

      The newline is made optional (\n?) in case the last line of the file is not terminated with a newline. Because (f)lex patterns never match zero characters, that rule is actually equivalent to the more precise but clunkier regular expression .*\n|.+.