parsing compiler-construction bison yacc

Bison Reduce/Reduce Conflict with Casting and Expression Parentheses

I'm building a grammar in bison, and I've narrowed down my last reduce/reduce error to the following test-case:

%{
#include <stdio.h>
#include <string.h>

extern yydebug;

void yyerror(const char *str)
{
  fprintf(stderr, "Error: %s\n", str);
}

main()
{
  yydebug = 1;
  yyparse();
}
%}

%right '='
%precedence CAST
%left '('

%token AUTO BOOL BYTE DOUBLE FLOAT INT LONG SHORT SIGNED STRING UNSIGNED VOID

%token IDENTIFIER

%start file

%debug

%%

file
  : %empty
  | statement file
  ;

statement
  : expression ';'
  ;

expression
  : expression '=' expression
  | '(' type ')' expression %prec CAST
  | '(' expression ')'
  | IDENTIFIER
  ;

type
  : VOID
  | AUTO
  | BOOL
  | BYTE
  | SHORT
  | INT
  | LONG
  | FLOAT
  | DOUBLE
  | SIGNED
  | UNSIGNED
  | STRING
  | IDENTIFIER
  ;

Presumably the issue is that it can't tell the difference between type and expression when it sees (IDENTIFIER) in an expression.

Output:

fail.y: warning: 1 reduce/reduce conflict [-Wconflicts-rr]
fail.y:64.5-14: warning: rule useless in parser due to conflicts [-Wother]
   | IDENTIFIER
     ^^^^^^^^^^

What can I do to fix this conflict?

Solution

If the grammar were limited to the productions shown in the OP, it would be relatively easy to remove the conflict, since the grammar is unambiguous. The only problem is that it is LR(2) and not LR(1).

The analysis in the OP is completely correct. When the parser sees, for example:

( identifier1 · )

(where · marks the current point, so the lookahead token is )), it is not possible to know whether that is a prefix of

( identifier1 · ) ;
( identifier1 · ) = ...
( identifier1 · ) identifier2
( identifier1 · ) ( ...

In the first two cases, identifier1 must be reduced to expression, so that ( expression ) can subsequently be reduced to expression, whereas in the last two cases, identifier1 must be reduced to type so that ( type ) expression can subsequently be reduced to expression. If the parser could see one token further into the future, the decision could be made.

Since for any LR(k) grammar, there is an LR(1) grammar which recognizes the same language, there is clearly a solution; the general approach is to defer the reduction until the one-token lookahead is sufficient to distinguish. One way to do this is:

cast_or_expr   : '(' IDENTIFIER ')'
               ;
cast           : cast_or_expr
               | '(' type ')'
               ;
expr_except_id : cast_or_expr
               | cast expression %prec CAST
               | '(' expr_except_id ')'
               | expression '=' expression
               ;
expression     : IDENTIFIER
               | expr_except_id
               ;

(The rest of the grammar is the same except for the removal of IDENTIFIER from the productions for type.)

That works fine for grammars where no symbol can be both a prefix and an infix operator (like -) and where no operator can be elided (effectively, as in function calls). In particular, it won't work for C, because it will leave the ambiguities:

( t ) * a   // product or cast?
( t ) ( 3 ) // function-call or cast?

Those are real ambiguities in the grammar which can only be resolved by knowing whether t is a typename or a variable/function name.

The "usual" solution for C parsers is to resolve the ambiguity by sharing the symbol table between the scanner and the parser; since typedef type alias declarations must appear before the first use of the symbol as a typename in the applicable scope, it can be known in advance of the scan of the token whether or not the token has been declared with a typedef. More accurately, if the typedef has not been seen, it can be assumed that the symbol is not a type, although it may be completely undeclared.

By using a GLR grammar and a semantic predicate, it is possible to restrict the logic to the parser. Some people find that more elegant.