Search code examples
parsingsyntax-errorbisonoperator-precedence

Combining unary operators with different precedence


I was having some trouble with Bison creating an operator as such: <- = identity postfix operator with a low precedence to force evaluation of what's on the left first, e.g. 1+2<-*3 (equivalent (1+2)*3) as well as -> which is a prefix operator which does the same thing but to the right.

I was not able to get the syntax to work properly and tested with Python using - not False, which resulted in a syntax error (in Python, - has a greater precedence than not). However, this is not a problem in C or C++, where - and !/not have the same precedence.

Of course, the difference in precedence has nothing to do with the relationship between the 2 operators, only a relationship with other operators that result in the relative precedences between them.

Why is chaining prefix or postfix operators with different precedences a problem when parsing and how can implement the <- and -> operators while still having higher-precedence operators like !, ++, NOT, etc.?

Obligatory Bison (this pattern is repeated for all operators, where copy has greater precedence than post_unary):

post_unary:
    copy
|   post_unary "++"
|   post_unary "--"
|   post_unary '!'
;

Chaining operators in this category, e.g. x ! -- ! works fine syntactically.


Solution

  • Ok, let me suggest a possible erroneous grammar based on your sketch:

    low_postfix:
        mid_infix
    |   low_postfix "<-"
    mid_infix:
        high_postfix
    |   mid_infix '+' high_postfix
    high_postfix:
        term
    |   high_postfix "++"
    term:
        ID
        '(' expr ')'
    

    It should be clear just looking at those productions that var <- ++ is not part of the language. The only things that can be used as an operand to ++ are terms and other applications of ++. var <- is neither of these things.

    On the other hand, var ++ <- is fine, because the operand to <- can be a mid_infix which can be a high_postfix which is an application of the ++ operator.

    If the intention were to allow both of those postfix sequences, then that grammar is incorrect.

    A version of that cascade is present in the Python grammar (albeit using prefix operators) which is why not - False is OK, but - not False is a syntax error. I'm reluctant to call that a bug because it may have been intentional. (Really, neither of those expressions makes much sense.) We could disagree about the value of such an intention but not on SO, which prefers to avoid opinionated discussions.

    Note that what we might call "strict precedence" in this grammar and the Python grammar is by no means restricted to combinations of unary operators. Here's another one which you have likely never tried:

    $ python3 -c 'print(41 + not False)'
      File "<string>", line 1
        print(41 + not False)
                     ^
    SyntaxError: invalid syntax
    

    So, how can we fix that?

    On some level, it would be nice to be able to just write an unambiguous grammar which conveyed our intention. And it is certainly possible to write an unambiguous grammar, which would convey the intention to bison. But it's at least an open question as to whether it would convey anything to a human reader, because the massive clutter of multiple rules necessary in order to keep track of what is and is not an acceptable grouping would be pretty daunting.

    On the other hand, it's dead simple to do with bison/yacc precedence declarations. We just list the operators in order, and the parser generator resolves all the ambiguities accordingly. [See Note 1 below]

    Here's a similar grammar to the above, with precedence declarations. (I left the actions in place in case you want to play with it, although it's by no means a Reproducible Example; the infrastructure it relies upon is much bigger than the grammar itself, and of little use to anyone other than me. So you'll have to define the three functions and fill in some of the bison type declarations. Or just delete the AST functions and use your own.)

    %left ','
    %precedence "<-"                                                        
    %precedence "->" 
    %left '+'
    %left '*'                                                               
    %precedence NEG
    %right "++" '('
    %%
    expr: expr ',' expr                { $$ = make_binop(OP_LIST, $1, $3); }
        | "<-" expr                    { $$ = make_unop(OP_LARR, $2); }
        | expr "->"                    { $$ = make_unop(OP_RARR, $1); }
        | expr '+' expr                { $$ = make_binop(OP_ADD, $1, $3); }
        | expr '*' expr                { $$ = make_binop(OP_MUL, $1, $3); }
        | '-' expr          %prec NEG  { $$ = make_unop(OP_NEG, $2); }
        | expr '(' expr ')' %prec '('  { $$ = make_binop(OP_CALL, $1, $3); }
        | "++" expr                    { $$ = make_unop(OP_PREINC, $2); }
        | expr "++"                    { $$ = make_unop(OP_POSTINC, $1); }
        | VALUE                        { $$ = make_ident($1); }
        | '(' expr ')'                 { $$ = $2; }
    

    A couple of notes:

    1. I used %prec NEG on the unary minus production in order to separate that production from the subtraction production. I also used a %prec declaration to modify the precedence of the call production (the default would be ')'), although in this particular case that's unnecessary. It is necessary to put '(' into the precedence list, though. ( is the lookahead symbol which is used in precedence comparisons.

    2. For many unary operators, I used bison %precedence declaration in the precedence list, rather than %right or %left. Really, there is no such thing as associativity with unary operators, so I think that it's more self-documenting to use %precedence, which doesn't resolve conflicts involving reductions and shifts in the same precedence level. However, even though there is no such thing as associativity between unary operators, the nature of the precedence resolution algorithm is that you can put prefix operators and postfix operators in the same precedence level and choose whether the postfix or prefix operators have priority by using %right or %left, respectively. %right is almost always correct. I did that with ++, because I was a bit lazy by the time I got to that point.

    This does "work" (I think). It certainly resolves all the conflicts; bison happily produces a parser without warnings. And the tests that I tried worked at least as I expected them to:

    ? a++->
    => [-> [++/post a]]
    ? a->++
    => [++/post [-> a]]
    ? 3*f(a)+2
    => [+ [* 3 [CALL f a]] 2]
    ? 3*f(a)->+2
    => [+ [-> [* 3 [CALL f a]]] 2]
    ? 2+<-f(a)*3
    => [+ 2 [<- [* [CALL f a] 3]]]
    ? 2+<-f(a)*3->
    => [+ 2 [<- [-> [* [CALL f a] 3]]]]
    

    But there are some expressions where the operator precedence, while "correct", might not be easily explained to a novice user. For example, although the arrow operators look somewhat like parentheses, they don't group that way. Furthermore, the decision as to which of the two operators has higher precedence seems to me to be totally arbitrary (and indeed I might have done it differently from what you expected). Consider:

    ? <-2*f(a)->+3
    => [<- [+ [-> [* 2 [CALL f a]]] 3]]
    ? <-2+f(a)->*3
    => [<- [* [-> [+ 2 [CALL f a]]] 3]]
    ? 2+<-f(a)->*3
    => [+ 2 [<- [* [-> [CALL f a]] 3]]]
    

    There's also something a bit odd about how the arrow operators override normal operator precedence, so that you can't just drop them into a formula without changing its meaning:

    ? 2+f(a)*3
    => [+ 2 [* [CALL f a] 3]]
    ? 2+f(a)->*3
    => [* [-> [+ 2 [CALL f a]]] 3]
    

    If that's your intention, fine. It's your language.

    Note that there are operator precedence problems which are not quite so easy to solve by just listing operators in precedence order. Sometimes it would be convenient for a binary operator to have different binding power on the left- and right-hand sides.

    A classic (but perhaps controversial) case is the assignment operator, if it is an operator. Assignment must associate to the right (because parsing a = b = 0 as (a = b) = 0 would be ridiculous), and the usual expectation is that it greedily accepts as much to the right as possible. If assignment had consistent precedence, then it would also accept as much to the left as possible, which seems a bit strange, at least to me. If a = 2 + b = 7 is meaningful, my intuitions say that its meaning should be a = (2 + (b = 7)) [Note 2]. That would require differential precedence, which is a bit complicated but not unheard of. C solves this problem by restricting the left-hand side of the assignment operators to (syntactic) lvalues, which cannot be binary operator expressions. But in C++, it really does mean a = ((2 + b) = 7), which is semantically valid if 2 + b has been overloaded by a function which returns a reference.


    Notes

    1. Precedence declarations do not really add any power to the parser generator. The languages it can produce a parser for are exactly the same languages; it produces the same sort of parsing machine (a pushdown automaton); and it is at least theoretically possible to take that pushdown automaton and reverse engineer a grammar out of it. (In practice, the grammars produced by this process are usually monstrous. But they exist.)

      All that the precedence declarations do is resolve parsing conflicts (typically in an ambiguous grammar) according to some user-supplied rules. So it's worth asking why it's so much simpler with precedence declarations than by writing an unambiguous grammar.

      The simple hand-waving answer is that precedence rules only apply when there is a conflict. If the parser is in a state where only one action is possible, that's the action which remains, regardless of what the precedence rules might say. In a simple expression grammar, an infix operator followed by a prefix operator is not at all ambiguous: the prefix operator must be shifted, because there is no reduce action for a partial sequence ending with an infix operator.

      But when we're writing a grammar, we have to specify explicitly what constructs are possible at each point in the grammar, which we usually do by defining a bunch of non-terminals, each corresponding to some parsing state. An unambiguous grammar for expressions already has split the expression non-terminal into a cascading series of non-terminals, one for each operator precedence value. But unary operators do not have the same binding power on both sides (since, as noted above, one side of the unary operator cannot take an operand). That means that a binary operator could well be able to accept a unary operator for one of its operands, and not be able to accept the same unary operator for its other operand. Which in turn means that we need to split all of our non-terminals again, corresponding to whether the non-terminal appears on the left or the right side of a binary operator.

      That's a lot of work, and it's really easy to make a mistake. If you're lucky, the mistake will result in a parsing conflict; but equally it could result in the grammar not being able to recognise a particular construct which you would never think of trying, but which some irate language user feels is an absolute necessity. (Like 41 + not False)

    2. It's possible that my intuitions have been permanently marked by learning APL at a very early age. In APL, all operators associate to the right, basically without any precedence differences.