Search code examples
grammaryaccreduce-reduce-conflict

Reduce/reduce conflict


I am writing a grammar (YACC - "LALR") that should recognize the following words, for example:

ident(ident,...,ident) = ident(num,ident,num,...,num) 
ident(ident,...,ident) = num
ident = num
ident = ident(num,ident,num,...,num) 
ident(ident,num,...,num)
num
ident

And I wrote the following grammar:

(1) L : ident ( PARAM ) = E
(2) L : E

(3) E : ident ( ARG )
(4) E : N

(5) N : num
(6) N : ident

(7) ARG : ARG , E
(8) ARG : E

(9) PARAM : PARAM , ident
(10)PARAM : ident

So, my problem is a reduce/reduce conflict between rules (6) and (10). Has anyone an idea of how to resolve this ?


Previously, i asked about the following problem, thinking it was a simplification of my real problem ... but it was not.

For the record, my previous question was on the grammar to recognize those words :

a
b
b,b
b,b,b
[...]
b,b,b,[...],b # infinite list of b

So, I wrote the following grammar:

L : E
  | F

E : a
  | b

F : F , b
  | b

But, logically, I have a reduce/reduce conflict for rules:

F : b

and

E : b

The correct grammar of this problem is :

E : a
E : F
F : F,b
F : b

Solution

  • The easy solution is to check the validity of the parenthesized list in a semantic action attached to production 1. Then you just get rid of PARAM, and use ARG instead.

    The other easy solution is to ask bison to generate a GLR parser instead of an LALR parser. Since the grammar is unambiguous, this will work fine. It will be slightly slower, but in most cases you won't notice.

    However, it is possible to modify the grammar so that it will recognize precisely the desired language. There is one problem: we can't tell whether ident(ident) is an E or the start of an L until we see the following token. Furthermore, we can't tell (in most cases) whether an ident in a parenthesized list is part of an L, or whether it should be reduced to an N because it is part of an E.

    In order to avoid conflicts, we build the AST without certain reductions, and then fix it when necessary. In particular, we may need to convert an L to an E or a PARAM to an ARG, which involves converting the list of idents into a list of args, which in turn involves performing the reduction of ident to N on each ident.

    (In the following, I refer to the actual code I wrote, which uses the normal convention that terminals are ALL-CAPS, while non-terminals are lower case. Habits die hard. Sorry.)

    So what we do is to divide the productions for comma-separated lists into two disjoint sets. One is name_list, which is a simple list of identifiers (IDs), and which might turn out to be a list of parameters or a list of arguments. The other production is arg_list, which includes at least one number (NUM). That is unambiguously an argument list.

    If we are actually parsing an argument list, we might start out parsing it as an identifier list, but we will eventually find something which forces us to recognize it for what it is. Either we'll hit a NUM, in which case we need to retroactively reduce all the IDs to values, or we will get to the end of the statement without having seen an =, in which case we need to repurpose the lvalue as a call expression.

    So that results in the following. In order to be as clear as possible, I include the semantic actions. The actual functions are not included, but I trust that their behaviour is more or less obvious from their names. Note that there is an explicit conversion in two actions: one to convert a param_list to an arg_list (when a NUM is encountered), and the other one to convert an lvalue to an expr. Also, I didn't actually insert the value: ID | NUM production, because in this case it was more straight-forward to just do the reduction as part of the semantic action. See the note following the grammar.

    prog: /* EMPTY */
        | prog stmt ';'         { print_stmt($2); free_node($2); }
    
    param_list
        : ID                    { $$ = push_name(0, $1); }
        | param_list ',' ID     { $$ = push_name($1, $3); }
    arg_list
        : NUM                   { $$ = push_val(0, make_ival($1)); }
        | arg_list ',' ID       { $$ = push_val($1, make_sval($3)); }
        | arg_list ',' NUM      { $$ = push_val($1, make_ival($3)); }
        | param_list ',' NUM    { $$ = push_val(names_to_vals($1),
                                                make_ival($3)); }
    lvalue
        : ID '(' param_list ')' { $$ = make_proto($1, reverse_names($3)); }
    expr: ID '(' arg_list ')'   { $$ = make_call($1, reverse_vals($3)); }
        | lvalue                { $$ = proto_to_call($1); }
        | NUM                   { $$ = make_ival($1); }
        | ID                    { $$ = make_sval($1); }
    stmt: lvalue '=' expr       { $$ = make_assign($1, $3); }
        | expr
    

    And here is a sample output from the above:

    id1;
    [E: id1]
    3;
    [E: 3]
    f(id1);
    [CALL: f([E: id1])]
    f(3);
    [CALL: f([E: 3])]
    f(id1,3);
    [CALL: f([E: id1], [E: 3])]
    f(3,id1);
    [CALL: f([E: 3], [E: id1])]
    f(id1)=g(id2);
    [ASSIGN: [PROTO: f(id1)] = [CALL: g([E: id2])]]
    f(id1)=42;
    [ASSIGN: [PROTO: f(id1)] = [E: 42]]
    f(id1)=g(42);
    [ASSIGN: [PROTO: f(id1)] = [CALL: g([E: 42])]]
    f(id1)=g;
    [ASSIGN: [PROTO: f(id1)] = [E: g]]
    

    In a real grammar, it is likely that arg_list will actually be a list of expr, not just ID or NUM. That can still work with the above model. We need to define expr_not_ID (that is, an expr which is not just an ID), which we would use in place of NUM in the above productions. Then we can define expr as expr_not_ID | ID, for use in the two productions for stmt (and, probably, in other places in the grammar).