yacc infix-notation lalr shift-reduce-conflict ocamlyacc

Shift/reduce conflict with infix sections

I'm having trouble with a yacc-like implementation (using ocamlyacc, in particular) of a grammar that includes ordinary infix operations plus infix sections, like in Haskell. I desire all of these to be grammatical:

(+1)
(1+)
(+)
(1+1)

However, I haven't been able to get this working, even by fiddling with associativity/precedence declarations. I can see in the grammar.output where the problem is happening (it's shifting where I want it to reduce), but I haven't been able to coax it to go the way I want. Here is a simplified demonstration of the problem.

lex.mll has:

{
  open Parse
  exception Eof
}
rule token = parse
  | [' ' '\t'] { token lexbuf }
  | ['\n'] { EOL }
  | ['0'-'9']+ as num {INT(int_of_string num)}
  | '+' { PLUS }
  | '*' { TIMES }
  | '(' { LPAREN }
  | ')' { RPAREN }
  | eof { raise Eof }

main.ml has:

let _ =
  try
    let lexbuf = Lexing.from_channel stdin in
    while true do
      let result = Parse.start Lex.token lexbuf in
      print_string result; print_newline(); flush stdout
    done
  with Lex.Eof -> exit 0

and parse.mly (where the trouble lies) has:

%token <int> INT
%token PLUS TIMES
%token LPAREN RPAREN
%token EOL

%left PLUS
%left TIMES

%start start
%type <string> start
%%

start:
| expr EOL {$1}
;

expr:
| application {$1}
| expr PLUS expr {"[" ^ $1 ^ "+" ^ $3 ^"]"}
| expr TIMES expr {"[" ^ $1 ^ "*" ^ $3 ^"]"}
;

section:
| LPAREN atom PLUS RPAREN { "(" ^ $2 ^ " +)" }
| LPAREN PLUS atom RPAREN { "(+ " ^ $3 ^ ")" }
| LPAREN PLUS RPAREN { "(+)" }
;

application:
| atom {$1}
| application atom {"[" ^ $1 ^ " " ^ $2 ^ "]"}
;

atom:
| INT {string_of_int $1}
| section { $1 }
| LPAREN expr RPAREN { "(" ^ $2 ^ ")" }
;

%%

Running ocamlyacc on that tells me there is 1 shift/reduce conflict. In particular here is the relevant part of the verbose log:

Rules:
   6  section : LPAREN atom PLUS RPAREN
   ...
   9  application : atom
...
12: shift/reduce conflict (shift 21, reduce 9) on PLUS
state 12
        section : LPAREN atom . PLUS RPAREN  (6)
        application : atom .  (9)

        PLUS  shift 21
        INT  reduce 9
        MINUS  reduce 9
        TIMES  reduce 9
        LPAREN  reduce 9
        RPAREN  reduce 9
...
state 21
        section : LPAREN atom PLUS . RPAREN  (6)

        RPAREN  shift 26
        .  error

And running the compiled program will correctly parse all of the following:

(1+)
(+1)
(+)
1+2

but fails with:

(1+2)

If on the other hand, I create a dummy token HIGH with high precedence:

%left PLUS MINUS
%left TIMES
%nonassoc HIGH

and then put %prec HIGH on the Rule 9:

application: atom %prec HIGH {$1}

in that case (1+2) will parse but (1+) won't.

I understand the general background of shift/reduce conflicts. I just can't figure out how to negotiate it to solve this parsing challenge.

Solution

Leaving out a lot of your grammar, you have the following productions, all of which could be simultaneously feasible.

atom:    LPAREN expr RPAREN
expr:           expr PLUS expr
section: LPAREN atom PLUS RPAREN

So let's say we've just read ( 0 -- that is, an LPAREN and an INT -- and the next token up is +. At this point, we need to reduce the INT to an atom, but we can't tell whether what follows will match the atom or the section rule. To match the atom rule, we'd need to reduce the atom to an expr -- by way of application -- but to match the section rule, we need it to stay as an atom. So we have a shift/reduce conflict; we don't know whether we need to shift the + now, or after doing some more unit reductions.

The simple solution is to delay the decision. If the section rule were:

section: LPAREN expr PLUS RPAREN

then there wouldn't be a problem. We'd continue the unit reductions until we got an expr, then we'd shift the +, and then we'd either see an ) or we'd see something which could start an expr. Conflict resolved.

Of course, that changes the language, making it more permissive. We might not want to accept:

( 3 + 4 + )

( (+) 3 4 + )

But the resulting grammar is not ambiguous. We could just let the parser continue and then issue an error message when reducing section, by checking to see whether $2 was appropriately restricted. (This is a pretty common technique, and there is nothing wrong with it.)

Alternatively, we could separate the

expr: expr PLUS expr

rule into two mutually-exclusive alternatives:

expr: atom PLUS expr
expr: expr_not_an_atom PLUS expr

This would also solve the conflict, because the atom could not be reduced to expr_not_an_atom. But it leaves open the question of how to define expr_not_an_atom.

As it happens, I'm pretty sure that is possible but it's not quite trivial, and the consequences will ripple down the grammar. I can't give you an algorithm, either, because CFGs -- unlike regular expressions -- are not closed under negation or set difference. But basically, you need to just cascade through the non-terminals, splitting them so each alternative will fit either into atom or expr_not_an_atom- This is also a legitimate approach, but the resulting grammar can be hard to read.

If you were using bison, you would have another alternative: generate a GLR grammar. As long as your language is not ambiguous, the GLR grammar will find the correct parse, possibly slightly more slowly but with a lot less effort on your part.

In case it helps, here's a slightly related answer in which I produced a fully-worked out solution to splitting non-terminals.