I have the following fragment in my Bison file that describes a simple "while" loop as a condition followed by a sequence of statements. The list of statements is large and includes BREAK
and CONTINUE
. The latter two can be used only within a loop.
%start statements
%%
statements: | statement statements
statement: loop | BREAK | CONTINUE | WRITE | ...
loop: WHILE condition statements ENDWHILE
condition: ...
%%
I can add a C variable, set it upon entering the loop, reset it upon exiting, and check at BREAK
or CONTINUE
, but this solution does not look elegant:
loop: WHILE {loop++;} condition statements {loop--;} ENDWHILE
statement: loop | BREAK {if (!loop) yyerror();} ...
Is there a way to prevent the two statements from outside a loop using only Bison rules?
P.S. What I mean is "Is there an EASY way..," without fully duplicating the grammar.
Sure. You just need three different statement
non-terminals, one which matches all statements; one which matches everything but continue
(for switch
blocks), and one which matches everything but break
and continue
. Of course, this distinction needs to trickle down through your rules. You'll also need three versions of each type of compound statement: loops, conditionals, switch, braced blocks, and so on. Oh, and don't forget that statements can be labelled, so there are some more non-terminals to duplicate.
But yeah, it can certainly be done. The question is, is it worth going to all that trouble. Or, to put it another way, what do you get out of it?
To start with, the end user finds that where they used to have a pretty informative error message about continue
statements outside a loop, they now just get a generic Syntax Error. Now, you can fix that with some more grammar modifications, by actually providing productions which match the invalid statements, and then present a meaningful error message. But that's almost exactly the same code already rejected as inelegant.
Other than that, does it in any way reduce parser complexity? It lets you assume that a break statement is legally placed, but you still have to figure out where the break
statement's destination. And other than that, there's not really a lot of evident advantages, IMHO.
But if you want to do it, go for it.
Once you've done that, you could try modifying your grammar so that break
, continue
, goto
and return
cannot be followed by an unlabelled statement. That sounds like a good idea, and some languages do it. It can certainly be done in the grammar. (But before you get too enthusiastic, remember that some programmers do deliberately create dead code during debugging sessions, and they won't thank you for making it impossible.)
There is a BNF extension, used in the ECMAscript standard, amongst others, which parameterizes non-terminals with a list of features, each of which can be present or not. These parameters can then be used in productions, either as conditions or to be passed through to non-terminals on the right-hand side. This could be used to generate three versions of statement
, using the features [continue]
and [break]
, which would be used as gates on those respective statement syntaxes, and also passed through to the compound statement non-terminals.
I don't know of a parser generator capable of handling such parameterised rules, so I can't offer it as a concrete suggestion, but this question is one of the use cases which motivated parameterised non-terminals. (In fact, I believe it's one of the uses, but I might be remembering that wrong.)
With an ECMAScript-style formalism, this grammatical restriction could be written without duplicating rules. The duplication would still be there, under the surface, since the parser generator would have to macro expand the templated rules into their various boolean possibilities. But the grammar is a lot more readable and the size of the state machine is not so important these days.
I have no doubt that it would be a useful feature, but I also suspect that it would be overused, with the result that the quality of error messages would be reduced.
As a general rule, compilers should be optimised for correct inputs, with the additional goal of producing helpful error messages for invalid input. Complicating the grammar even a little to make easily described errors into syntax errors does not help with either of these goals. If it's possible to write a few lines of code to produce the correct error message for a detected problem, instead of emitting a generic syntax error, I would definitely do that.
There are (many) other use cases for the ECMAScript BNF extensions. For example they make it much easier to describe a syntax whose naive grammar requires two or three lookahead tokens.