Search code examples
parsinggrammaryacc

Grammar conflict with same prefix


Here's my grammar to the for statements:

FOR x>0 {
    //somthing
}

// or

FOR x = 0; x > 0; x++ {
   //somthing
}

it has the same prefix FOR, and I'd want to print the for_begin label after InitExpression, however the codes right after FOR will become useless because of confliction.

    ForStmt
            : FOR   {
                                printf("for_begin_%d:\n", n);
                       } Expression {
                                                           printf("ifeq for_exit_%d\n", n);
                                                 } ForBlock
           | FOR ForClause ForBlock
    ;

ForClause
            : InitExpression ';'   {
                                                            printf("for_begin_%d:\n", n);
                                                    }  Expression  ';'  Expression  { printf("ifeq for_exit_%d\n", n); }
    ;

I had tried to change it to something like:

ForStart
      : FOR
      | FOR InitExpression
;

or use a flag to mention where to print the for_begin label, but also fail to resolve the conflict.

How to make it not conflict?


Solution

  • How can the parser know which alternative of the FOR statement it sees?

    While it's possible that an InitExpression has identifiable form, such as an assignment statement, which could not be used in a conditional expression. That strikes me as too restrictive for practical purposes -- there are many things you might do to initialise a loop other than a direct assignment -- but leaving that aside, it means that the earliest the InitExpression can be definitively identified is when the assignment operator is seen. If lvalues in your language can only be simple identifiers, that would make it the second lookahead token after the FOR, but in most useful language lvalues can be much more complicated than just simple identifiers, and so it's likely that the InitExpression cannot be definitively identified with finite lookahead.

    But it's more likely that the only significant difference between the two forms is that the expression in the first form is followed by a block (which I suppose cannot start with a semicolon) and the first expression in the second form is followed by a semicolon. So the parser knows what it is parsing at the end of the first expression and no earlier.

    Normally, that would not cause a problem. Were it not for the MidRule Action which inserts a label, the parser does not have to make a reduction decision until it reaches the end of the first expression, at which point it needs to decide whether to reduce the first expression as an InitExpression or an Expression. But at that point, the lookahead token as either a semicolon or the first token of a block, so the lookahead token can guide the decision.

    But the Mid-Rule Action makes that impossible. The Mid-Rule Action must either be reduced or not before shifting the token which immediately follows the FOR token, and -- as your examples show -- the lookahead token could be the same (i) in both cases.

    Fundamentally, the issue is that you want to build a one-pass compiler rather than just parsing the input into an AST and then walking the AST to generate assembler code (possibly after doing some other traverses over the AST in order to perform other analyses and allow for code optimisation). The one-pass code generator depends on Mid-Rule Actions, and Mid-Rule Actions in turn can easily generate unresolvable parsing conflicts. This issue is so notorious that there is a chapter in the bison manual dedicated to it, which is well worth reading.

    So there is no good solution. But in this case, there is a simple solution, because the action you want to take is just to insert a label, and inserting a label which happens never to be used is not in any way going to affect the code which will ultimately be executed. So you might as well insert a label immediately after the FOR statement, whether you will need it or not, and then insert another label after the InitExpression if it turns out that there was such a thing. You don't need to actually know which label to use until you reach the end of the conditional expression, which is much later.

    As explained in the Bison manual chapter I already linked to, this cannot be done using Mid-Rule Actions, because Bison doesn't attempt to compare Mid-Rule Actions with each other. Even if two actions happen to be identical, Bison will still need to decide which one to execute, thereby generating a conflict. So instead of using an MRA, you need to house the action in a marker non-terminal -- a non-terminal with an empty right-hand side, used only to trigger an action.

    That would make the grammar look something like this:

    ForLabel
          : %empty             { $$ = n; printf("for_begin_%d:\n", n++); }             
    ForStmt
          : FOR
              ForLabel[label]
              Expression       { printf("ifeq for_exit_%d\n", label); }
              ForBlock         { printf("jmp for_begin_%d\n", label);
                                 printf("for_exit_%d:\n", label); }
          | FOR
              ForLabel
              InitExpress ';'
              ForLabel[label]
              Expression ';'
              Expression       { printf("ifeq for_exit_%d\n", label); }
              ForBlock         { printf("jmp for_begin_%d\n", label);
                                 printf("for_exit_%d:\n", label); }
    ;
    

    ([label] gives a name to a semantic value, which avoids having to use a rather mysterious and possibly incorrect $2 or $6. See Named References in the handy Bison manual.)