I'm designing my own toy language that compiles into Brainf*ck code and encountered problem with the operator precedence of the index operator. The parser is generated by Bisonc++, which uses mostly the same syntax for its grammar files as its cousins bison and bison++.
Because of the quirks in addressing memory in BF, I have to handle operations on array-elements differently, which is why I am generating specific rules for them (I left out the bodies of the rules on purpose). The relevant grammar (which is not working), is as follows:
%right '='
// a bunch of other operators
%left '+' '-'
// more operators
%left indexing
expression:
variable
|
array_element
|
expression '+' expression
|
// a bunch more rules
;
array_subscript:
'[' expression ']'
;
array_element:
expression array_subscript %prec indexing
;
Now, I was under the impression that by assigning the indexing
precedence to the array_element
rule, the parser will match the smallest expression directly to the left of it. However, consider the following code:
10 + arr[0]
// is parsed as
(10 + arr)[0]
I can't figure out why this is happening. When the code is changed to 10 + (arr[0])
, the parser does the expected thing, which I think proves this has to do with operator precedence. One of the many things I tried was using an unused symbol as the indexing operator. For example:
// bottom-most operator in the precedence definition (replacing 'indexing'):
%left '@'
array_element:
expression '@' array_subscript
;
// Code:
10 + arr@[0] // is now parsed properly
This resolves the issue for some reason. Am I doing something fundamentally wrong? Am I misunderstanding the way operator precedence works in bison et al?
Thanks in advance!
EDIT
I've pushed my dev branch to github, which contains a folder with a minimal working example including a Makefile and test-file. To rebuild the parser with different grammar-specs (file: grammar
), you can run make regenerate
, but this requires bisonc++
and flexc++
(available from Debian repo's). To just build the project with the current (faulty) parser, run make
.
https://github.com/jorenheit/brainfix/tree/better_indexing/arraybug
The precedence declaration
array_element:
expression array_subscript %prec indexing
and the order of precedence of the pseudo-token indexing
have absolutely no effect on the disambiguation of the indexing expression
10 + arr[0]
The ambiguity being resolved by precedence in this case, as you say, is between
(10 + arr)[0] and 10 + (arr[0])
and it is resolved in the moment when the parser has seen 10 + arr
and needs to decide whether or not it is (10 + arr)
. In other words, its options are:
expression: expression '+' expression
, or[
.
If it chooses to shift, then expression: expression '+' expression
will be reduced later, with a longer second operand, because all reductions are performed precisely at the end of the production being reduced, before the next token is shifted.So the precedence ordering which is being considered is between the production expression '+' expression
and the lookahead token [
. Precedence comparisons are always of that form: the precedence of a possible reduction is compared with the precedence of the lookahead token which might be shifted.
Precedence is, somewhat confusingly, always described as a comparison between two tokens, probably because that is the actual algorithm used in operator precedence parsing, which is a different parsing algorithm. In shift-reduce parsing, the precedence comparison is always between a shift and a reduction. To maintain the illusion that it is between two tokens, Bison, like almost all Yacc derivatives, uses the convention that the precedence of a reduction is based on the precedence ordering of the last visible terminal in the right-hand side. (BisonC++ is an exception.)
Usually there is usually only one visible terminal in a production which requires precedence resolution, so the fact that Yacc/Bison selects the last one isn't noticeable, but there are occasional exceptions, such as the so-called "ternary operator" in C. The precedence of the production expression: expression '?' expression ':' expression
is based on the :
token; the relevant lookahead token would be ?
, so the resolution of a?s1:b?s2:s3
is based on the comparison of ?
with :
; both of those tokens need to be in the same precedence level. To avoid this sort of confusion, and to allow precedence to be used with productions with no visible tokens, such as expression: expression array_subscript
, Yacc derivatives typically allow you to override the default by specifying an alternative token for a production, using the %prec
declaration.
This is particularly important if BisonC++ is part of your environment, because unlike almost every other Yacc derivative, BisonC++ uses the first non-terminal on the right-hand side. On the whole, it is probably a good idea to always use a %prec
declaration in rules with more than one terminal symbol.
Regardless, it's important to remember that all %prec
does is identify the precedence of a production with a terminal symbol. It has no effect on the precedence of any lookahead token, or any other production. If two rules have the same precedence, you can (and probably should) use the same %prec
declaration.
In short, the precedence of expression array_subscript
is not relevant in the resolution of 10 + arr[0]
, because reducing that production is not an option at this point in the parse. (Actually, it won't ever be relevant because array_subscript
is effectively a postfix operator, and therefore not ambiguous. Precedence resolution only occurs when there is an ambiguity in the grammar.)
I don't know why BisonC++ is choosing to reduce the addition. Originally, I thought that it was because you put [
into some precedence level which comes before +
, but apparently that is not the case. With Bison or Yacc, if [
were not in any precedence level, then it would have no declared precedence, and the precedence comparison would not be used to resolve ambiguity. If that were the case, (1) a shift-reduce conflict warning would be raised and (2) the parser would choose the shift action. And that's what happens with this grammar when tested with Bison. But BisonC++ apparently uses a default precedence, in some cases..
According to the "verbose" output generated by BisonC++, the conflicts were resolved using precedence. I haven't found any documentation for the precedence resolution algorithm used by Bisonc++, nor any note that it would be different from the algorithm used by Bison/Yacc (and, for what it's worth, required by Posix). But that doesn't mean that the documentation doesn't exist, and I'm reluctant to call this a bug without verifying what the original intent was. So I'll limit myself to saying that it is unusual.
In any case, you can force bisonc++ to resolve the conflict in the way you wish by adding a precedence level for [
(which, as noted above, is the lookahead token you're interested in). Put it after the precedence of +
(and all other infix and prefix operators, which I presume exist in the full grammar):
%left '+'
%left '['
(Tested, after a considerable fight with source dependencies, using BisonC++ 6.04.04.)