Consider this simple lex/yacc definition:
In .l:
PRINT { return PRINT;}
In .y:
PRINT printlist
{
statement_t *new = mkstatement(PRINT);
new->parms.print.using = NULL;
new->parms.print.l = $2;
$$ = new;
}
printlist:
expression
{
printitem_t *new = malloc(sizeof(*new));
new->e = $1;
new->sep = 0;
$$ = g_list_prepend(NULL, new);
} | { strings of expressions }
Simple enough. Now I would like to see and store comments. The classic solution is simple lex:
"//".*\n
This tokenizes the entire comment as a single token on the yacc side. Now I can extract the actual comment using string processing, but that's kinda what lex/yacc is for. So am I missing a simple way to parse a REM like a PRINT, that is, is there an easy way to get "everything else on the line" as $2? I've tried several things, but invariably that causes the lex side to match every line against it because it's the longest match.
(F)lex does not provide any mechanism similar to "captures" in regex libraries. yytext
is always the complete token recognised by the (f)lex pattern.
Sometimes you can just use fixed offsets to extract the interesting portion of a text token. For example, you might see this sort of (f)lex action, which removes the quotation marks from a string literal: (Simplified; real parsers would probably care about backslash-escapes):
["][^"]*["] { yylval.str = strndup(yytext + 1, yyleng - 2); }
That would certainly work for your comment case, which I would write without the terminating newline character (in part because it's possible that there isn't one, and in part because newlines are almost certainly handled elsewhere in the scanner, and there may be some action associated with a newline):
"//".* { yylval.str = strndup(yytext + 2, yyleng - 2); return TOK_STRING; }
Perhaps you'd prefer to remove the leading whitespace (if any) in the comment before passing through the text. Before discussing that, let me suggest that you might actually want to leave the whitespace in place. Perhaps the comment contains a nicely-indented code sample, whose formatting will be destroyed by removing leading whitespace from all the comment bodies.
But if you really wanted to removed the whitespace, then you would have two possibilities:
You could rescan the token, looking for the first non-whitespace character after the prefix, and then copy the rest of the token into yylval
. I guess that's precisely what you mean by "using string processing", and I can see why you might think it's ugly. (Although in this case it has the virtue of simplicity.)
You can use a start condition to put the scanner in a different lexical context, and then use normal (f)lex patterns to identify the interesting part of the comment token:
%x SC_COMMENT
%%
"//"[[:blank:]]* { BEGIN(SC_COMMENT); }
<SC_COMMENT>.* { yylval.str = strdup(yytext); return TOK_COMMENT; }
<SC_COMMENT>\n { BEGIN(INITIAL); }
That associates the comment text with the comment token itself, avoiding the need to use an additional parser rule. But if for some reason you really wanted to write a redundant parser rule, you could easily modify the above to produce two tokens:
%x SC_COMMENT
%%
"//"[[:blank:]]* { BEGIN(SC_COMMENT); return TOK_COMMENT; }
<SC_COMMENT>.* { yylval.str = strdup(yytext); return TOK_COMMENT_BODY; }
<SC_COMMENT>\n { BEGIN(INITIAL); }