I'm having trouble with a yacc-like implementation (using ocamlyacc, in particular) of a grammar that includes ordinary infix operations plus infix sections, like in Haskell. I desire all of these to be grammatical:
(+1)
(1+)
(+)
(1+1)
However, I haven't been able to get this working, even by fiddling with associativity/precedence declarations. I can see in the grammar.output where the problem is happening (it's shifting where I want it to reduce), but I haven't been able to coax it to go the way I want. Here is a simplified demonstration of the problem.
lex.mll has:
{
open Parse
exception Eof
}
rule token = parse
| [' ' '\t'] { token lexbuf }
| ['\n'] { EOL }
| ['0'-'9']+ as num {INT(int_of_string num)}
| '+' { PLUS }
| '*' { TIMES }
| '(' { LPAREN }
| ')' { RPAREN }
| eof { raise Eof }
main.ml has:
let _ =
try
let lexbuf = Lexing.from_channel stdin in
while true do
let result = Parse.start Lex.token lexbuf in
print_string result; print_newline(); flush stdout
done
with Lex.Eof -> exit 0
and parse.mly (where the trouble lies) has:
%token <int> INT
%token PLUS TIMES
%token LPAREN RPAREN
%token EOL
%left PLUS
%left TIMES
%start start
%type <string> start
%%
start:
| expr EOL {$1}
;
expr:
| application {$1}
| expr PLUS expr {"[" ^ $1 ^ "+" ^ $3 ^"]"}
| expr TIMES expr {"[" ^ $1 ^ "*" ^ $3 ^"]"}
;
section:
| LPAREN atom PLUS RPAREN { "(" ^ $2 ^ " +)" }
| LPAREN PLUS atom RPAREN { "(+ " ^ $3 ^ ")" }
| LPAREN PLUS RPAREN { "(+)" }
;
application:
| atom {$1}
| application atom {"[" ^ $1 ^ " " ^ $2 ^ "]"}
;
atom:
| INT {string_of_int $1}
| section { $1 }
| LPAREN expr RPAREN { "(" ^ $2 ^ ")" }
;
%%
Running ocamlyacc
on that tells me there is 1 shift/reduce conflict
. In particular here is the relevant part of the verbose log:
Rules:
6 section : LPAREN atom PLUS RPAREN
...
9 application : atom
...
12: shift/reduce conflict (shift 21, reduce 9) on PLUS
state 12
section : LPAREN atom . PLUS RPAREN (6)
application : atom . (9)
PLUS shift 21
INT reduce 9
MINUS reduce 9
TIMES reduce 9
LPAREN reduce 9
RPAREN reduce 9
...
state 21
section : LPAREN atom PLUS . RPAREN (6)
RPAREN shift 26
. error
And running the compiled program will correctly parse all of the following:
(1+)
(+1)
(+)
1+2
but fails with:
(1+2)
If on the other hand, I create a dummy token HIGH
with high precedence:
%left PLUS MINUS
%left TIMES
%nonassoc HIGH
and then put %prec HIGH
on the Rule 9:
application: atom %prec HIGH {$1}
in that case (1+2)
will parse but (1+)
won't.
I understand the general background of shift/reduce conflicts. I just can't figure out how to negotiate it to solve this parsing challenge.
Leaving out a lot of your grammar, you have the following productions, all of which could be simultaneously feasible.
atom: LPAREN expr RPAREN
expr: expr PLUS expr
section: LPAREN atom PLUS RPAREN
So let's say we've just read ( 0 -- that is, an LPAREN
and an INT
-- and the next token up is +. At this point, we need to reduce the INT
to an atom
, but we can't tell whether what follows will match the atom
or the section
rule. To match the atom
rule, we'd need to reduce the atom
to an expr
-- by way of application
-- but to match the section
rule, we need it to stay as an atom
. So we have a shift/reduce conflict; we don't know whether we need to shift the + now, or after doing some more unit reductions.
The simple solution is to delay the decision. If the section
rule were:
section: LPAREN expr PLUS RPAREN
then there wouldn't be a problem. We'd continue the unit reductions until we got an expr
, then we'd shift the +, and then we'd either see an ) or we'd see something which could start an expr
. Conflict resolved.
Of course, that changes the language, making it more permissive. We might not want to accept:
( 3 + 4 + )
or
( (+) 3 4 + )
But the resulting grammar is not ambiguous. We could just let the parser continue and then issue an error message when reducing section
, by checking to see whether $2
was appropriately restricted. (This is a pretty common technique, and there is nothing wrong with it.)
Alternatively, we could separate the
expr: expr PLUS expr
rule into two mutually-exclusive alternatives:
expr: atom PLUS expr
expr: expr_not_an_atom PLUS expr
This would also solve the conflict, because the atom
could not be reduced to expr_not_an_atom
. But it leaves open the question of how to define expr_not_an_atom
.
As it happens, I'm pretty sure that is possible but it's not quite trivial, and the consequences will ripple down the grammar. I can't give you an algorithm, either, because CFGs -- unlike regular expressions -- are not closed under negation or set difference. But basically, you need to just cascade through the non-terminals, splitting them so each alternative will fit either into atom
or expr_not_an_atom
- This is also a legitimate approach, but the resulting grammar can be hard to read.
If you were using bison
, you would have another alternative: generate a GLR grammar. As long as your language is not ambiguous, the GLR grammar will find the correct parse, possibly slightly more slowly but with a lot less effort on your part.
In case it helps, here's a slightly related answer in which I produced a fully-worked out solution to splitting non-terminals.