Search code examples
parsingyaccshift-reduce-conflictlemonshift-reduce

How to overcome shift-reduce conflict in LALR grammar


I am trying to parse positive and negative decimals.

number(N) ::= pnumber(N1).

number(N) ::= nnumber(N1).

number(N) ::= pnumber(N1) DOT pnumber(N2).

number(N) ::= nnumber(N1) DOT pnumber(N2).

pnumber(N) ::= NUMBER(N1).

nnumber(N) ::= MINUS NUMBER(N1).

The inclusion of the first two rules gives a shift/reduce conflict but I don't know how I can write the grammar such that the conflict never occurs. I am using the Lemon parser.

Edit: conflicts from .out file

State 79:
     (56) number ::= nnumber *
          number ::= nnumber * DOT pnumber

                           DOT shift        39
                           DOT reduce       56      ** Parsing conflict **
                     {default} reduce       56     number ::= nnumber

State 80:
     (55) number ::= pnumber *
          number ::= pnumber * DOT pnumber

                           DOT shift        40
                           DOT reduce       55      ** Parsing conflict **
                     {default} reduce       55     number ::= pnumber
State 39:
          number ::= nnumber DOT * pnumber
          pnumber ::= * NUMBER

                        NUMBER shift-reduce 59     pnumber ::= NUMBER
                       pnumber shift-reduce 58     number ::= nnumber DOT pnumber

State 40:
          number ::= pnumber DOT * pnumber
          pnumber ::= * NUMBER

                        NUMBER shift-reduce 59     pnumber ::= NUMBER
                       pnumber shift-reduce 57     number ::= pnumber DOT pnumber

Edit 2: Minimal grammar that causes issue

start ::= prog.
prog ::= rule.
rule ::= REVERSE_IMPLICATION body DOT.
body ::= bodydef.
body ::= body CONJUNCTION bodydef.
bodydef ::= literal.
literal ::= variable.
variable ::= number.
number ::= pnumber.
number ::= nnumber.
number ::= pnumber DOT pnumber.
number ::= nnumber DOT pnumber.
pnumber ::= NUMBER.
nnumber ::= MINUS NUMBER.

Solution

  • The conflicts you show indicate a problem with how the number non-terminal is used, not with number itself.

    The basic problem is that after seeing a pnumber or nnumber, when the next token of lookahead is a DOT, it can't decide if that should be the end of the number (reduce, so DOT is part of some other non-terminal after the number), or if the DOT should be treated as part of the number (shifted so it can later reduce one of the p/nnumber DOT pnumber rules.)

    So in order to diagnose the problem, you'll need to show all the rules that use number anywhere on the right hand side (and recursively any other rules that use any of those rules' non-terminals on the right).

    Note that it is rarely useful to post just a fragment of a grammar, as the LR parser construction process depends heavily on the context of where the rules are used elsewhere in the grammar...


    So the problem here is that you need two-token lookahead to differentiate between a DOT in a (real) number literal and a DOT at the end of a rule.

    The easy fix is to let the lexer deal with it -- lexers can do small amounts of lookahead quite easily, so you can recognize REAL_NUMBER as a distinct non-terminal from NUMBER (probably still without the -, so you'd end up with

    number ::= NUMBER | MINUS NUMBER | REAL_NUMBER | MINUS REAL_NUMBER
    

    It's much harder to remove the conflict by factoring the grammar but it can be done.


    In general, to refactor a grammar to remove a lookahead conflict, you need to figure out the rules that manifest the conflict (rule and number here) and refactor things to bring them together into rules that have common prefixes until you get far enough along to disambiguate.

    First, I'm going to assume there are other rules besides number that can appear here, as otherwise we could just eliminate all the intervening rules.

    variable ::= number | name
    

    We want to move the number rule "up" in the grammar to get it into the same place as rule with DOT. So we need to split the containing rules to special case when they end with a number. We add a suffix to denote the rules that correspond to the original rule with all versions that end in a number split off

    variable ::= number | variable_n
    variable_n ::= name
    

    ...and propagate that "up"

    literal ::= number | literal_n
    literal_n ::= variable_n
    

    ...and again

    bodydef ::= number | bodydef_n
    bodydef_n := literal_n
    

    ...and again

    body ::= number | body_n
    body := body CONJUNCTION number
    body_n ::= bodydef_n
    body_n ::= body CONJUNCTION bodydef_n
    

    Notice that as you move it up, you need to split up more and more rules, so this process can blow up the grammar quite a bit. However, rules that are used only at the end of a rhs that you're refactoring will end up only needing the _n version, so you don't necessarily have to double the number of rules.

    ...last step

    rule ::= REVERSE_IMPLICATION body_n DOT
    rule ::= REVERSE_IMPLICATION number DOT
    rule ::= REVERSE_IMPLICATION body CONJUNCTION number DOT
    

    Now you have the DOTs in all the same places, so expand the number rules:

    rule ::= REVERSE_IMPLICATION body_n DOT
    rule ::= REVERSE_IMPLICATION integer DOT
    rule ::= REVERSE_IMPLICATION integer DOT pnumber DOT
    rule ::= REVERSE_IMPLICATION body CONJUNCTION integer DOT
    rule ::= REVERSE_IMPLICATION body CONJUNCTION integer DOT pnumber DOT
    

    and the shift-reduce conflicts are gone, because the rules have common prefixes up until past the needed lookahead to determine which to use. I've reduced the number of rules in this final expansion by adding

    integer ::= pnumber | nnumber