parsing compiler-construction grammar shift-reduce-conflict cup

reduce/reduce conflict in CUP

I am implementing a parser for a subset of Java using Java CUP.

The grammar is like

vardecl ::= type ID
type    ::= ID | INT | FLOAT | ...
exp     ::= ID | exp LBRACKET exp RBRACKET | ...
stmt    ::= ID ASSIGN exp SEMI

This works fine, but when I add

stmt ::= ID ASSIGN exp SEMI
        |ID LBRACKET exp RBRACKET ASSIGN exp SEMI

CUP won't work, the warnings are:

Warning : *** Shift/Reduce conflict found in state #122
  between exp ::= identifier (*) 
  and     statement ::= identifier (*) LBRACKET exp RBRACKET ASSIGN exp SEMI 
  under symbol LBRACKET
  Resolved in favor of shifting.

Warning : *** Reduce/Reduce conflict found in state #42
  between type ::= identifier (*) 
  and     exp ::= identifier (*) 
  under symbols: {}
  Resolved in favor of the first production.

Warning : *** Shift/Reduce conflict found in state #42
  between type ::= identifier (*) 
  and     statement ::= identifier (*) LBRACKET exp RBRACKET ASSIGN exp SEMI 
  under symbol LBRACKET
  Resolved in favor of shifting.

Warning : *** Shift/Reduce conflict found in state #42
  between exp ::= identifier (*) 
  and     statement ::= identifier (*) LBRACKET exp RBRACKET ASSIGN exp SEMI 
  under symbol LBRACKET
  Resolved in favor of shifting.

I think there are two problems:
1. type ::= ID and exp ::= ID, when the parser sees an ID, it wants to reduce it, but it doesn't know which to reduce, type or exp.

stmt ::= ID LBRACKET exp RBRACKET ASSIGN exp SEMI is for assignment of an element in array, such as arr[key] = value;
exp :: exp LBRACKET exp RBRACKET is for expression of get an element from array, such as arr[key]

So in the case arr[key], when the parser sees arr, it knows that it is an ID, but it doesn't know if it should shift or reduce to exp.

However, I have no idea of how to fix this, please give me some advice if you have, thanks a lot.

Solution

Your analysis is correct. The grammar is LR(2) because declarations cannot be identified until the ] token is seen, which will be the second-next token from the ID which could be a type.

One simple solution is to hack the lexer to return [] as a single token when the brackets appear as consecutive tokens. (The lexer should probably allow whitespace between the brackets, too, so it's not quite trivial but it's not complicated.) If a [ is not immediately followed by a ], the lexer will return it as an ordinary [. That makes it easy for the parser to distinguish between assignment to an array (which will have a [ token) and declaration of an array (which will have a [] token).

It's also possible to rewrite the grammar, but that's a real nuisance.

The second problem -- array indexing assignment versus array indexed expressions. Normally programming languages allow assignment of the form:

exp [ exp ] = exp

and not just ID [ exp ]. Making this change will delay the need to reduce until late enough for the parser to identify the correct reduction. Depending on the language, it's possible that this syntax is not semantically meaningful but checking that is in the realm of type checking (semantics) not syntax. If there is some syntax of that form which is meaningful, however, there is no obvious reason to prohibit it.

Some parser generators implement GLR parsers. A GLR parser would have no problem with this grammar because it is no ambiguous. But CUP isn't such a generator.