Search code examples
parsingbisonglr

Bison: distinguish variables by type / YYERROR in GLR-parsers


I'm working on a parser, using GNU bison, and I'm facing an interesting problem. My actual problem is a bit different and less interesting in general, so I will state it a bit differently, so that an answer would be more useful in general.

I need to distinguish expressions depending on their type, like arithmetic expressions from string expressions. Their top-level nonterminals have some common ancestors, like

statement
    : TOK_PRINT expression TOK_SEMICOLON {...}
;
expression
    : arithmeticexpression {...}
    | stringexpression {...}
;

Now I need to be able to have variables in both kinds of expressions

arithmeticexpression
    : TOK_IDENTIFIER {...}
;
stringexpression
    : TOK_IDENTIFIER {...}
;

(only the variables of string-type should be allowed in the stringexpression case and only variables of int- or float-type should be allowed in the arithmeticexpression case) But this obviously causes an R/R conflict, which is inherent in the language - it is impossible to resolve, because the language is ambiguous.

Of course I could water down the language, so that only general "Expression" objects are passed around on the parse-stack, but then I'd have to do a lot of manual type-checking in the actions, which I would like to avoid. Also, in my real use-case, the paths through the grammar to the variable-rules are so different, that I would have to water the language down so much that I would lose a lot of grammar-rules (i.e. lose a lot of structural information) and would need to hand-write parsing-engines into some actions.

I have read about GLR-parsing, which sounds like it could solve my problem. I'm considering to use this feature: have the grammar as above and YYERROR in the paths where the corresponding variable has the wrong type.

arithmeticexpression
    : TOK_IDENTIFIER {
          if(!dynamic_cast<IntVariable*>(
              symbol_table[*$<stringvalue>1]))
              YYERROR;
      }
;
stringexpression
    : TOK_IDENTIFIER {
          if(!dynamic_cast<StringVariable*>(
              symbol_table[*$<stringvalue>1]))
              YYERROR;
      }
;

but the Bison Manual says

  1. During deterministic GLR operation, the effect of YYERROR is the same as its effect in a deterministic parser.
  2. The effect in a deferred action is similar, but the precise point of the error is undefined; instead, the parser reverts to deterministic operation, selecting an unspecified stack on which to continue with a syntax error.
  3. In a semantic predicate (see Semantic Predicates) during nondeterministic parsing, YYERROR silently prunes the parse that invoked the test.

I'm not sure I understand this correctly - I understand it this way:

  1. doesn't apply here, because GLR-parsing isn't deterministic
  2. is how I have it in the code above, but shouldn't be done this way, because it is completely random which path is killed by the YYERROR (correct me if I'm wrong)
  3. wouldn't solve my problem, because semantic predicates (not semantic actions!) must be at the beginning of a rule (correct me if I'm wrong), at which point the yylval from the TOK_IDENTIFIER token isn't accessible (correct me if I'm wrong), so I can't look into the symbol table to find the type of the variable.

Does anyone have experience with such a situation? Is my understanding of the manual wrong? How would you approach this problem? To me, the problem seems to be so natural that I would assume people run into it often enough that bison would have a solution built-in...


Solution

  • The example here works, but when I put it together with the solution for Bison: GLR-parsing of valid expression fails without error message I ran into the following problem:

    The variable could be determined by more than one identifier. You could have an array-index or it could be a member of a different object. I modeled this situation by having another non-terminal like

    lvalue
        : TOK_IDENTIFIER
        | lvalue '[' arithmeticexpression ']'
        | lvalue '.' TOK_IDENTIFIER
    

    But when I now have

    arithmeticexpression : lvalue
    stringexpression : lvalue
    

    and (try to) access the object from the non-terminal "lvalue" like above, I get a segmentation fault. So it seems that the method here doesn't work for more complex situations.


    What I did instead now is that I access the object in the semantic ACTION and set $$ to nullptr if the variable has the wrong type.

    arithmeticexpression
        : lvalue
        {
            $$ = nullptr;
            if(dynamic_cast<IntVariable*>($lvalue->getVariable()))
                $$ = new ArithmeticExpressionIntVariable($lvalue);
        }
    

    Then I had to propagate the nullptr (this is quite a lot of additional code) like this

    arithmeticexpression
        ...
        | arithmeticexpression[left] '+' arithmeticexpressionM[right]
        {
            $$ = nullptr;
            if($left && $right)
                $$ = new ArithmeticExpressionPlus($left, $right);
        }
    

    So now, at the

    expression
        : arithmeticexpression
        | stringexpression
    (note: I have "Expression* expr;" in the "%union{" declaration
    and "%type <expression> expr" in the prologue)
    

    I have an ambiguity: the same input text can be parsed in two different ways, but only one of them will have a value != nullptr. At this point, I need a custom merge procedure which basically just selects the non-null value.

    For this, I forward-declared it in the prologue of my bison file like this

    static Expression* exprMerge (yy::semantic_type x0, yy::semantic_type x1);
    

    and defined it in the epilogue like this

    static Expression* exprMerge (YYSTYPE x0, YYSTYPE x1) \
    {   
        /* otherwise we'd have a legitimate ambiguity */
        assert(!x0.expr || !x1.expr);
        return x0.expr ? x0.expr : x1.expr;
    }
    

    finally, I had to tell bison to use this procedure to resolve the ambiguity, using

    expression
        : arithmeticexpression  %merge <exprMerge>
        | stringexpression %merge <exprMerge>
    

    Honestly, I think this is quite a lot of effort, which wouldn't be necessary if bison tested semantic predicates (at least the ones which are behind the rule) when it tries to merge the paths, not to rule out paths before they merge.

    But at least this works and is far less effort than collecting the tokens and sorting them manually, which would have been the alternative.