I have a large-ish grammar in PLY (Python Lexx Yacc) for a language that has some particular challenges in parsing. The language allows the leading syntax of two kinds of calls to look the same almost until the end of the call non-terminal. This provides lot's of opportunity for reduce/reduce conflicts, since the semantics of tokens along the way are different, but can be built out of the same terminal tokens. I've extracted simple before/after versions of the grammar below, which I'll explain a bit.
Originally, expressions were a typical "Stratified Grammar," taking calls and literals and such to primary, then primary through unary, then binary to general expression. The problem was that Call_expr
with two arguments conflicts with the version of Iter_expr
that begins with two Id's before the '/'. The conflict was on the comma after the first argument in the call, since originally, Expr -> ... -> Primary_expr -> Name_expr -> Id
was allowed. The parser could either reduce Id
to Expr
to match the Call_expr
, or leave it to match the Iter_expr
. Looking ahead to the comma didn't help it decide. If the first argument to a call is just an identifier (like a variable), this is legitimate ambiguity. Consider input id > id ( id , id ...
.
My approach was to make a kind of expression that could not be just an Id
. I added the production chain through all the expressions to give them '_nn
' versions--'not a name.' I could then define productions for Call_expr
that use any syntax in the first argument that make it more than just a name (e.g. operators, calls, etc.) to reduce it to BinOp_expr_nn
, and also allow a call production that just has an Id
as the first argument. This should convince the parser to just shift until it could resolve either the Iter_expr
or the Call_expr
(or at least know which path it's on.)
As you might have guessed, this screws everything up :). Tinkering with the expression chain also tinkers with Primary_expr
, which I still need to allow to reduce to Id
. But now, that's a reduce/reduce conflict--every Primary_expr
can stay there or go on to Unary_expr
. I can order those to make a choice (which might work), but I expect I'll just end up chasing another and another.
So, my question is: Is there a technique someone can articulate about how to allow the same tokens represent different semantics (i.e. expr vs id) that can still be parsed with LALR(1) like PLY? Short of that, any useful hacks that help manage the problem? Can this be disambiguated?
terminals: '+' '^' ',' '>' '(' ')' '/' ':' 'id' 'literal'
(i.e. punctuation (besides '->' and '|', initial-lower-case words)
non-terminals: initial-Upper-case words
Original Grammar:
S'-> S
S -> Call_expr
| Iter_expr
Expr -> BinOp_expr
BinOp_expr -> Unary_expr
BinOp_expr -> BinOp_expr '+' BinOp_expr
Unary_expr -> Primary_expr
| '^' BinOp_expr
Primary_expr -> Name_expr
| Call_expr
| Iter_expr
| Literal_expr
Name_expr -> Id
Args -> Expr
| Args ',' Expr
Call_expr -> Primary_expr '>' Id '(' ')'
| Primary_expr '>' Id '(' Args ')'
Iter_expr -> Primary_expr '>' Id '(' Id '/' Expr ')'
| Primary_expr '>' Id '(' Id ':' Id '/' Expr ')'
| Primary_expr '>' Id '(' Id ',' Id ':' Id '/' Expr ')'
Literal_expr -> literal
Id -> id
My attempt to disambiguate the first argument in Call_expr
:
S'-> S
S -> Call_expr
| Iter_expr
Expr -> BinOp_expr_nn
| BinOp_expr
BinOp_expr -> BinOp_expr_nn
| Unary_expr
BinOp_expr_nn -> Unary_expr_nn
| BinOp_expr '+' BinOp_expr
Unary_expr -> Primary_expr
| Unary_expr_nn
Unary_expr_nn -> Primary_expr_nn
| '^' BinOp_expr
Primary_expr -> Primary_expr_nn
| Name_expr
Primary_expr_nn -> Call_expr
| Iter_expr
| Literal_expr
Name_expr -> Id
Args -> Expr
| Args ',' Expr
Call_expr -> Primary_expr '>' Id '(' ')'
| Primary_expr '>' Id '(' Expr ')'
| Primary_expr '>' Id '(' Id , Args ')'
| Primary_expr '>' Id '(' BinOp_expr_nn , Args ')'
Iter_expr -> Primary_expr '>' Id '(' Id '/' Expr ')'
| Primary_expr '>' Id '(' Id ':' Id '/' Expr ')'
| Primary_expr '>' Id '(' Id ',' Id ':' Id '/' Expr ')'
Literal_expr -> literal
Id -> id
Despite the title of your post, your grammar is not ambiguous. It is simply not LR(1), for the reason you mention: the input
A ( B ,
can be the start of either a Call_expr
or an Iter_expr
. In the first case, the B
must be reduced to an Expr
and then to Args
; in the second case, it must not be reduced because it needs to still be an id
when the right-hand side id '(' id ',' id ':' id '/' Expr ')'
is reduced. The decision cannot be made simply by looking at a single lookahead token (the ,), so there is a shift-reduce conflict.
This conflict can be resolved with at most two additional lookahead tokens, since the shift is only a valid action if the , is followed by precisely an id
and then a :. So that makes the grammar LALR(3). Unfortunately, Ply does not generate LALR(3) parsers (and neither does yacc/bison), but there are some alternatives.
Since the grammar is unambiguous, it can be parsed without problems (and without any modifications) with a GLR parser. Ply doesn't produce GLR parsers either, but Bison can. That's not likely to be of much use to you, but I thought I should mention it in case you are not locked into the use of Python.
This is almost certainly the simplest solution, and it's what I would normally recommend. If you change the definition of Iter_expr
to:
Iter_expr : id '(' id '/' Expr ')'
| id '(' Args ':' id '/' Expr ')'
then it will still recognize every valid input (since both id
and id , id
are valid instances of Args
). This removes the shift-reduce conflict; in effect, it lets the parser avoid having to make a decision until it either encounters a ) (which would indicate a Call_expr
) or a : (indicating a Iter_expr
). (There's no problem with the first alternative for Iter_expr
since the decision to shift the / instead of reducing id
does not require additional lookahead.)
Of course, the second production for Iter_expr
will recognize a lot of things which are not valid iteration expressions: lists of more than 2 items, and lists which include expressions more complicated than just a single id
. However, those inputs are not valid programs at all, so they can simply be rejected in the action for Iter_expr
. The precise code to recognize a valid iteration will depend on how you represent your AST but it's not complicated: just check to make sure that the length of the Args
is either one or two, and that each item in the list is just an id
.
One way to compensate for the lack of lookahead information is to collect it in the lexer by collecting the necessary lookahead into a buffer and only output lexemes when their syntactic category is known. In this case, the lexer can look for the sequence '(' id ',' id ':'
and mark the first id
so that it can only be used in an Iter_expr
. In this case, the only change to the grammar is:
Iter_expr : id '(' id '/' Expr ')'
| id '(' id ':' id '/' Expr ')'
| id '(' iter_id ',' id ':' id '/' Expr ')'
While this will work fine in this particular case, it is not very maintainable. In particular, it relies on being able to define a simple and unambiguous pattern which can be implemented in the lexer. Since that pattern is a simplification of the grammar, it is quite possible that some future syntactic addition will create a syntax which also matches the same pattern. (This is called the lexical "hack" for a reason.)
As indicated, this grammar is LALR(3)
. But there is no such thing as an LALR(3)
language. Or more precisely, if a language has an LALR(k)
grammar, then it also has an LALR(1)
grammar, and that grammar can be produced mechanically from the LALR(k)
grammar. Moreover, given a parse for the one-symbol lookahead grammar, it is possible to reconstruct the parse tree for the original grammar.
Because the mechanically-produced grammars are quite large, this procedure is not very attractive, and I don't know of an implementation of the algorithm. Instead, it is most common to attempt to rewrite just a part of the grammar, as you attempted in the original question. This can be done, certainly, but the end result is not totally intuitive.
However, it's not that difficult. Here, for example, is a slightly simplified version of your grammar, with the redundant unit productions removed and a couple of fixes in operator precedence (possibly incorrect, since I don't know what semantics you're aiming for).
The productions whose names end with N
do not produce ID
. For each of them, there is a corresponding production without the N
which adds ID
. Once that is done, it was necessary to rewrite Args
to use the ExprN
production, as well as allowing the various argument lists which start with one or two ID
s. The Chain
production was just to avoid some repetition.
Start : Call
| Iter
Expr : ID
| ExprN
ExprN : UnaryN
| Expr '+' Unary
Unary : ID
| UnaryN
UnaryN : ChainN
| '^' Chain
Chain : ID
| ChainN
ChainN : PrimaryN
| Chain '>' CallIter
PrimaryN: LITERAL
| Call
| Iter
| '(' Expr ')'
Call : ID '(' ')'
| ID '(' ID ')'
| ID '(' ID ',' ID ')'
| ID '(' Args ')'
Iter : ID '(' ID '/' Expr ')'
| ID '(' ID ':' ID '/' Expr ')'
| ID '(' ID ',' ID ':' ID '/' Expr ')'
Args : ExprN ExprList
| ID ',' ExprN ExprList
| ID ',' ID ',' Expr ExprList
ExprList:
| ExprList ',' Expr
I haven't tested this as much as I would have wanted to, but I think it produces the correct language.