Faulty bison reduction using %glr-parser and %merge rules

Currently I'm trying to build a parser for VHDL which has some of the problems C++-Parsers have to face. The context-free grammar of VHDL produces a parse forest rather than a single parse tree because of it's ambiguity regarding function calls and array subscriptions

foo := fun(3) + arr(5);

This assignment can only be parsed unambiguous if the parser would carry around a hirachically, type-aware symbol table which it'd use to resolve the ambiguities somewhat on-the-fly.

I don't want to do this, because for statements like the aforementioned, the parse forest would not grow exponentially, but rather linear depending on the amount of function calls and array subscriptions.

(Except, of course, one would torture the parser with statements like)

foo := fun(fun(fun(fun(fun(4)))));

Since bison forces the user to just create one single parse-tree, I used %merge attributes to collect all subtrees recursively and added those subtrees under so called AMBIG nodes in the singleton AST.

The result looks like this.

In order to produce the above, I parsed the token stream "I=I(N);". The substance of the grammar I used inside the parse.y file, is collected below. It tries to resemble the ambiguous parts of VHDL:

start: prog
;

/* I cut out every semantic action to make this
   text more readable */
prog: assignment ';'
| prog assignment ';'
;

assignment: 'I' '=' expression
;

expression: function_call   %merge <stmtmerge2>
| array_indexing            %merge <stmtmerge2>
| 'N'
;

function_call: 'I' '(' expression ')'
| 'I'
;

array_indexing: function_call '(' expression ')'   %merge <stmtmerge>
| 'I' '(' expression ')'                           %merge <stmtmerge>
;

The whole sourcecode can be read at this github repository.

And now, let's get down to the actual Problem. As you can see in the generated parse tree above, the nodes FCALL1 and ARRIDX1 refer to the same single node EXPR1 which in turn refers to N1 twice.

This, by all means, should not have happened and I don't know why. Instead there should be the paths

FCALL1 -> EXPR2 -> N2
ARRIDX1 -> EXPR1 -> N1

Do you have any idea why bison reuses the aforementioned nodes?

I also wrote a bugreport on the official gnu mailing list for bison, without a reply to this point though. Unfortunately, due to the restictions for new stackoverflow users, I can't provide no link to this bug report...

Solution

That behaviour is expected.

expression can be unambiguously reduced, and that reduced value is used by both possible ambiguous reductions which include the value. Remember that GLR, like LR, is a left-to-right parser. When a reduction action is executed, all of the child reductions have already happened. The effect is not different from the use of a terminal in a right-hand side; the terminal will not be artificially copied in order to produce different instances in the ambiguous productions which use it.

For most people, this would be a feature rather than a bug, and I don't mean that as a joke. Without the graph-structured stack, GLR has exponential run-time. If you really want to do a deep copy of shared AST nodes when you merge parse trees, you will have to do it yourself, but I suggest that you find a way to make use of the fact that the parse forest is really an directed acyclic graph rather than a tree; you will probably be able to take advantage of the lack of duplication.