In Crafting Interpreters, we implement a little programming language using a recursive descent parser. Among many other things, it has these statements:
statement → exprStmt
| ifStmt
| printStmt
| whileStmt
| block ;
block → "{" declaration* "}" ;
whileStmt → "while" "(" expression ")" statement ;
ifStmt → "if" "(" expression ")" statement ( "else" statement )? ;
One of the exercises is to add a break
statement to the language. Also, it should be a syntax error to have this statement outside a loop. Naturally, it can appear inside other blocks, if
statements etc. if those are inside a loop.
My first approach was to create a new rule, whileBody
, to accept break
:
## FIRST TRY
statement → exprStmt
| ifStmt
| printStmt
| whileStmt
| block ;
block → "{" declaration* "}" ;
whileStmt → "while" "(" expression ")" whileBody ;
whileBody → statement
| break ;
break → "break" ";" ;
ifStmt → "if" "(" expression ")" statement ( "else" statement )? ;
But we have to accept break
inside nested loops, if
conditionals etc. What I could imagine is, I'd need a new rule for blocks and conditionals which accept break
:
## SECOND TRY
statement → exprStmt
| ifStmt
| printStmt
| whileStmt
| block ;
block → "{" declaration* "}" ;
whileStmt → "while" "(" expression ")" whileBody ;
whileBody → statement
| break
| whileBlock
| whileIfStmt
whileBlock→ "{" (declaration | break)* "}" ;
whileIfStmt → "if" "(" expression ")" whileBody ( "else" whileBody )? ;
break → "break" ";"
ifStmt → "if" "(" expression ")" statement ( "else" statement )? ;
It is not infeasible for now, but it can be cumbersome to handle it once the language grows. It is boring and error-prone to write even today!
I looked for inspiration in C and Java BNF specifications. Apparently, none of those specifications prohibit the break
outside loop. I guess their parsers have ad hoc code to prevent that. So, I followed suit and added code into the parser to prevent break
outside loops.
My questions are:
break
statement that only appears inside loops?break
command inside a syntax specification?
- Would the approach of my second try even work? In other words, could a recursive descent parser handle a
break
statement that only appears inside loops?
Sure. But you need a lot of duplication. Since while
is not the only loop construct, I've used a different way of describing the alternatives, which consists of adding _B
to the name of non-terminals which might include break
statements.
declaration → varDecl
| statement
declaration_B → varDecl
| statement_B
statement → exprStmt
| ifStmt
| printStmt
| whileStmt
| block
statement_B → exprStmt
| printStmt
| whileStmt
| breakStmt
| ifStmt_B
| block_B
breakStmt → "break" ";"
ifStmt → "if" "(" expression ")" statement ( "else" statement )?
ifStmt_B → "if" "(" expression ")" statement_B ( "else" statement_B )?
whileStmt → "while" "(" expression ")" statement_B ;
block → "{" declaration* "}"
block_B → "{" declaration_B* "}"
Not all statement types need to be duplicated. Non-compound statement like exprStmt
don't, because they cannot possibly include a break
statement (or any other statement type). And the statement
which is the target of a loop statement like whileStmt
can always include break
, regardless of whether the while
was inside a loop or not.
- Is there a more practical way to bake the break command inside a syntax specification?
Not unless your syntax specification has marker macros, like the specification used to describe ECMAScript.
- Is there a different way to do this?
Since this is a top-down (recursive descent) parser, it's pretty straight-forward to handle this condition in the parser's execution. You just need to add an argument to every (or many) parsing functions which specifies whether a break is possible or not. Any parsing function called by whileStmt
would set that argument to True
(or an enumeration indicating that break is possible), while other statement types would just pass the parameter through, and the top-level parsing function would set the argument to False
. The breakStmt
implementation would just return failure if it is called with False
.