Search code examples
parsingyaccreduce-reduce-conflictb-lang

How to resolve this reduce/reduce conflict?


I am writing a compiler for the B programming language. The grammar of this language distinguishes lvalues and rvalues syntactically. While translating the grammar into yacc syntax, I stumbled upon a reduce/reduce conflict. Here is a minimal, complete, and verifiable example:

%right '['
%left '+'

%%

rvalue  : '+' lvalue 
    | lvalue
    ;

lvalue  : 'x'
    | rvalue '[' rvalue ']'
    ;

Yacc indicates 1 reduce/reduce conflict. This reduce/reduce conflict is found in state 6:

6: reduce/reduce conflict (reduce 1, reduce 2) on '['
state 6
    rvalue : '+' lvalue .  (1)
    rvalue : lvalue .  (2)

    .  reduce 1

It seems obvious that “reduce 1” should be chosen as the resolution of this conflict as “reduce 2” seemingly cannot ever lead to a successful parse.

How can I resolve this conflict?

For reasons of portability, I am not willing to use bison or any features of yacc outside of those specified in POSIX.1 2008.


Solution

  • For the benefit of anyone reading this question and answer, it is perhaps useful to know that the + token in the question is intended to be the pre-increment operator ++. According to a comment, the change was made in order to avoid having to introduce a token declaration. Below, I've taken the liberty of changing '+' to the Bison syntax "++", because I think it's less confusing to use the normal spelling of the intended operator. I also did use Bison's quoted-token extension, because it is more readable. (But it's trivial to remove.)

    The conflict occurs because there actually is a valid parse which uses the rvalue: lvalue production. Specifically, the input

    ++ x [ x ]
    

    could be parsed by your grammar in two different ways:

          rvalue                                      rvalue
      /           \                                      |
    "++"        lvalue                                lvalue
            /--------------\                   /------------------\
         rvalue '[' rvalue ']'              rvalue   '['   rvalue   ']'
            |          |                   /      \           |
         lvalue     lvalue               "++"   lvalue     lvalue
            |          |                           |          |
           'x'        'x'                         'x'        'x'
    

    Note that the first one is the parse you want; the subscript operator binds more tightly than a prefix increment operator, so that ++x[x] is correctly parsed as ++ (x[x]). Pretty well all languages handle postfix operators that way, which matches the expected behaviour. (The vast majority of programmers would expect -x[3] to first extract element 3 of the array x and then negate it. Binding the -x first makes no sense at all. That's no less true for ++; if x is an array, ++x makes as little sense as -x.)

    That's contrary to your assertion that "reduce 1" should be chosen; the correct parse requires that "reduce 2" be taken. That error is also reflected in your precedence declaration, which logically should give precedence right-associatively to postfix operators:

    %right "++" '['
    

    (Technically, prefix operators bind less tightly than postfix operators. But it's OK for them to share a precedence level because of the right-associativity.)

    But there's no point making that change, because precedence declarations cannot resolve reduce/reduce conflicts, because the resolution by precedence always involves the comparison between the precedence of a production which could be reduced and the precedence of the lookahead token which could be shifted. (In other words, the types of the things being compared are different.)

    In State 6 (reproduced in the question), the parser has shifted the "++" and then the 'x', and then performed the forced reduction of 'x' to lvalue. So the parser stack is ... "++" lvalue and the lookahead token is [. If the grammar were not attempting to separate lvalues and rvalues (so that the top of the stack was just value instead of lvalue), then the choices available to the parser would be to reduce "++" value to value, or to shift [ in preparation for the right-hand side value '[' value ']'. With the above precedence level declaration, shifting would win because of right-associativity, so the correct parse would emerge.

    But the grammar is attempting to distinguish between lvalues and rvalues, and that makes it impossible for the parser to shift [; for [ to be valid, it must first reduce the lvalue to an rvalue. Precedence decisions, however, are always immediate; the parser doesn't really see the rvalue: lvalue reduction as somehow being the prelude to shifting [. What it sees is two competing reduce actions, and precedence doesn't apply to such conflicts.

    Since precedence declarations aren't going to help with this particular conflict, it's simplest to avoid trying to use them for unary operators, reserving their use for binary operators. (It would also be possible to not use them at all, but they are handy for expressing binary precedence.) The B reference manual [Note 1] makes it clear that the narrative text, not the included grammar, is what precisely defines operator precedence and associativity, and the narrative text includes two syntactic categories, Primary expressions and Unary expressions, which do not appear in the grammar but which are, in fact, syntactically necessary.

    It's easy to write a grammar using these non-terminals if we ignore the lvalue/rvalue distinction, so that's a good place to start. (Note: I moved the post-increment/decrement operators into primary to avoid a reliance on precedence declarations.)

    %token NAME CONSTANT
    %token INC "++" DEC "--"
    %left '+' '-'
    %left '*' '/' '%'
    %start value
    %%
    primary : NAME
            | primary '[' value ']'
            | CONSTANT
            | '(' value ')'
            | primary "++"
            | primary "--"
    unary   : primary
            | '*' unary 
            | '-' unary
            | '&' unary
            | "++" unary
            | "--" unary
    value   : unary
            | value '+' value
            | value '-' value
            | value '*' value
            | value '/' value
            | value '%' value
    

    Now we can see that there are two distinct non-terminals which need to be split into l and r variants, since both primary and unary can produce lvalues. (x[x] and *x, respectively.) However, it's not quite as simple as just dividing both of these non-terminals into two categories, because of the cascade:

    value   : unary
    unary   : primary 
    

    combined with the desired implicit reduction of lvalues to rvalues.

    Our first thought might be to just split the non-terminals, letting the cascade flow through the rvalue: lvalue productions:

    value   : runary
    runary  : lunary
            | rprimary
    lunary  : lprimary
    rprimary: lprimary
    

    Unfortunately, that produces two different paths for reaching an lprimary:

    value -> runary -> lunary   -> lprimary
    value -> runary -> rprimary -> lprimary 
    

    Since the cascade productions have no associated action, and the lvalue to rvalue conversion (a dereference operation) is the same for both instances, it actually makes no difference to us which of these paths is chosen. But the parser will care, so we have to eliminate one of them. Here's one possible solution:

    %token NAME CONSTANT
    %token INC "++" DEC "--"
    %left '+' '-'
    %left '*' '/' '%'
    %start value
    %%
    lprimary: NAME
            | primary '[' value ']'
    primary : lprimary
            | rprimary
    rprimary: CONSTANT
            | '(' value ')'
            | lprimary "++"
            | lprimary "--"
    lunary  : lprimary
            | '*' runary
    runary  : lunary
            | rprimary 
            | '-' runary
            | '&' runary
            | "++" lunary
            | "--" lunary
    value   : runary
            | value '+' value
            | value '-' value
            | value '*' value
            | value '/' value
            | value '%' value