Given following grammar:
comment "/*" "*/" ;
TInt. Type1 ::= "int" ;
TBool. Type1 ::= "bool" ;
coercions Type 1 ;
BTrue. BExp ::= "true" ;
BFalse. BExp ::= "false" ;
EOr. Exp ::= Exp "||" Exp1 ;
EAnd. Exp1 ::= Exp1 "&&" Exp2 ;
EEq. Exp2 ::= Exp2 "==" Exp3 ;
ENeq. Exp2 ::= Exp2 "!=" Exp3 ;
ELt. Exp3 ::= Exp3 "<" Exp4 ;
EGt. Exp3 ::= Exp3 ">" Exp4 ;
ELte. Exp3 ::= Exp3 "<=" Exp4 ;
EGte. Exp3 ::= Exp3 ">=" Exp4 ;
EAdd. Exp4 ::= Exp4 "+" Exp5 ;
ESub. Exp4 ::= Exp4 "-" Exp5 ;
EMul. Exp5 ::= Exp5 "*" Exp6 ;
EDiv. Exp5 ::= Exp5 "/" Exp6 ;
EMod. Exp5 ::= Exp5 "%" Exp6 ;
ENot. Exp6 ::= "!" Exp ;
EVar. Exp8 ::= Ident ;
EInt. Exp8 ::= Integer ;
EBool. Exp8 ::= BExp ;
EIver. Exp8 ::= "[" Exp "]" ;
coercions Exp 8 ;
Decl. Decl ::= Ident ":" Type ;
terminator Decl ";" ;
LIdent. Lvalue ::= Ident ;
SBlock. Stm ::= "{" [Decl] [Stm] "}" ;
SExp. Stm ::= Exp ";" ;
SWhile. Stm ::= "while" "(" Exp ")" Stm ;
SReturn. Stm ::= "return" Exp ";" ;
SAssign. Stm ::= Lvalue "=" Exp ";" ;
SPrint. Stm ::= "print" Exp ";" ;
SIf. Stm ::= "if" "(" Exp ")" "then" Stm "endif" ;
SIfElse. Stm ::= "if" "(" Exp ")" "then" Stm "else" Stm "endif" ;
terminator Stm "" ;
entrypoints Stm;
parser created with bnfc fails to parse
{ c = a; }
although it parses
c = a;
or
{ print a; c = a; }
I think it could be a problem that parser sees Ident and doesn't know whether it's declaration or statement, LR stuff etc (still one token of lookeahed should be enough??). However I couldn't find any note in BNFC documentation that would say that it doesn't work for all grammars.
Any ideas how to get this working?
I would think you would get a shift/reduce conflict report for that grammar, although where that error message shows up might well depend on which tool BNFC is using to generate the parser. As far as I know, all the backend tools have the same approach to dealing with shift/reduce conflicts, which is to (1) warn the user about the conflict, and then (2) resolve the conflict in favour of shifting.
The problematic production is this one: (I've left out type annotations to reduce clutter)
Stm ::= "{" [Decl] [Stm] "}" ;
Here, [Decl]
and [Stm]
are macros, which automatically produce definitions for the non-terminals with those names (or something equivalent which will be accepted by the backend tool). Specifically, the automatically-produced productions are:
[Decl] ::= /* empty */
| Decl ';' [Decl]
[Stm] ::= /* empty */
| Stm [Stm]
(The ;
in the first rule is the result of a "terminator" declaration. I don't know why BNFC generates right-recursive rules, but that's how I interpret the reference manual -- after a very quick glance -- and I'm sure they have their reasons. For the purpose of this problem, it doesn't matter.
What's important is that both Decl
and Stm
can start with an Ident
. So let's suppose we're parsing { id ...
, which might be { id : ...
or { id = ...
, but we've only read the {
and the lookahead token id
. So there are two possibilities:
id
is the start of a Decl
. We should shift the Ident
and go to the state which includes Decl → Ident • ':' Type
id
is the start of a Stm
. In this case, we need to reduce the production [Decl] → •
before we shift Ident
into a Stm
production.
So we have a shift/reduce conflict, because we cannot see the second next token (either :
or =
). And, as mentioned above, shift usually wins in this case, so the LR(1) parser will commit itself to expect a Decl
. Consequently, { a = b ; }
will fail.
An LR(2) parser generator would do fine with this grammar, but those are much harder to find. (Modern bison can produce GLR parsers, which are even more powerful than LR(2) at the cost of a bit of extra compute time, but not the version required by the BNFC tool.)
Allow declarations to be intermingled with statements. This one is my preference. It is simple, and many programmers expect to be able to declare a variable at first use rather than at the beginning of the enclosing block.
Make the declaration recognizable from the first token, either by putting the type first (as in C) or by adding a keyword such as var
(as in Javascript):
Modify the grammar to defer the lookahead. It is always possible to find an LR(1) grammar for any LR(k) language (provided k is finite), but it can be tedious. An ugly but effective alternative is to continue the lexical scan until either a :
or some other non-whitespace character is found, so that id :
gets tokenized as IdentDefine
or some such. (This is the solution used by bison
, as it happens. It means that you can't put comments between an identifier and the following :
, but there are few, if any, good reasons to put a comment in that context.