Search code examples
parsinggrammarbisonlr-grammarshift-reduce-conflict

How can I eliminate shift-reduce conflicts by the same operator


I intended to use bison to parse some scripting language, in this language I can write code like the following:

a = input()
b = a + 1
function myfunc
    a = input()
    b = a + 1
end function

I found that the block

a = input()
b = a + 1

which appear both in and out of the function definition can be reduced by the same rule stmts, so I write code like the following

%{
#include <stdio.h>
#include <string>
#include <sstream>
#include <iostream>
#include <stdarg.h>
#include <tuple>

using namespace std;
%}

%debug

%token CRLF EXP FUNCTIONBEGIN FUNCTIONEND

%start program

%%

stmt : EXP
    |
stmts : stmt CRLF stmts
    | stmt
function : FUNCTIONBEGIN CRLF stmts CRLF FUNCTIONEND
unit : function
    | stmts
program : unit
    | unit CRLF program

%%

Of course this code can't work due to one shift/reduce conflict

State 3

    3 stmts: stmt . CRLF stmts
    4      | stmt .

    CRLF  shift, and go to state 9

    CRLF      [reduce using rule 4 (stmts)]
    $default  reduce using rule 4 (stmts)

I thought this conflict is due to both my program rule and stmts rule using the same terminal CRLF as a "binary operator", so I can't solve this conflict by set priority to operators.

Maybe I can merge program rule and stmts rule together by somehow adding another two rules to stmt

stmts : function CRLF stmts
    | function

However I thought this method(whether it can practically work) is not very beautiful, so I ask if there's some other solutions


Solution

  • The problem has nothing to do with CRLF tokens. Rather, it is your definition of program. A program is a series of units where each unit is a function or a stmts. But stmts is not a "unit", which is hinted at by the fact that its name is plural. A stmts is a series of stmts.

    So suppose we have a program consisting of three statements. How many units is that? Is it one stmts consisting of all three statements? Or two of them, one with two statements and the other with just one? Or the other way around? Or even three units, each consisting of a stmts containing a single statement?

    The parser can't tell which of those alternatives is desired because the grammar is ambiguous. And that is what creates the conflict.

    The simplest solution is to change the production unit: stmts to be singular: unit: stmt. Then there is no ambiguity; the three-statement program has exactly three units, each a single stmt.

    By the way, you should always prefer left recursion when writing LR grammars. Right recursion doesn't usually create conflicts, and it has nothing to do with your current problem, but it does lead to unnecessarily Iarge parsing stacks, and the reduction of lists like units and stmts will execute from right-to-left as the components are popped off the stack, which is often not what is intended.