Search code examples
cparsinglemon

How to handle same symbol used for two things lemon parser


I'm developing a domain specific language. Part of the language is exactly like C expression parsing semantics such as precidence and symbols.

I'm using the Lemon parser. I ran into an issue of the same token being used for two different things, and I can't tell the difference in the lexer. The ampersand (&) symbol is used for both 'bitwise and' and "address of".

At first I thought it was a trivial issue, until I realized that they don't have the same associativity.

How do I give the same token two different associativities? Should I just use AMP (as in ampersand) and make the addressof and bitwise and rules use AMP, or should I use different tokens (such as ADDRESSOF and BITWISE_AND). If I do use separate symbols, how am I supposed to know which one from the lexer (it can't know without being a parser itself!).


Solution

  • If you're going to write the rules out explicitly, using a different non-terminal for every "precedence" level, then you do not need to declare precedence at all, and you should not do so.

    Lemon, like all yacc-derivatives, uses precedence declarations to remove ambiguities from ambiguous grammars. The particular ambiguous grammar referred to is this one:

    expression: expression '+' expression
              | expression '*' expression
              | '&' expression
              | ... etc, etc.
    

    In that case, every alternative leads to a shift-reduce conflict. If your parser generator didn't have precedence rules, or you wanted to be precise, you'd have to write that as an unambiguous grammar (which is what you've done):

    term: ID | NUMBER | '(' expression ')' ;
    postfix_expr:        term | term '[' expression '] | ... ;
    unary_expr:          postfix_expr | '&' unary_expr | '*' unary_expr | ... ;
    multiplicative_expr: unary_expr | multiplicative_expr '*' postfix_expr | ... ;
    additive_expr:       multiplicative_expr | additive_expr '+' multiplicative_expr | ... ;
    ...
    assignment_expr:     conditional_expr | unary_expr '=' assignment_expr | ...; 
    expression:          assignment_expr ;
    [1]
    

    Note that the unambiguous grammar even shows left-associative (multiplicative and additive, above), and right-associative (assignment, although it's a bit weird, see below). So there are really no ambiguities.

    Now, the precedence declarations (%left, %right, etc.) are only used to disambiguate. If there are no ambiguities, the declarations are ignored. The parser generator does not even check that they reflect the grammar. (In fact, many grammars cannot be expressed as this kind of precedence relationship.)

    Consequently, it's a really bad idea to include precedence declarations if the grammar is unambiguous. They might be completely wrong, and mislead anyone who reads the grammar. Changing them will not affect the way the language is parsed, which might mislead anyone who wants to edit the grammar.

    There is at least some question about whether it's better to use an ambiguous grammar with precedence rules or to use an unambiguous grammar. In the case of C-like languages, whose grammar cannot be expressed with a simple precedence list, it's probably better to just use the unambiguous grammar. However, unambiguous grammars have a lot more states and may make parsing slightly slower, unless the parser generator is able to optimize away the unit-reductions (all of the first alternatives in the above grammar, where each expression-type might just be the previous expression-type without affecting the AST; each of these productions needs to be reduced, although it's mostly a no-op, and many parser generators will insert some code.)

    The reason C cannot simply be expressed as a precedence relationship is precisely the assignment operator. Consider:

    a = 4 + b = c + 4;
    

    This doesn't parse because in assignment-expression, the assignment operator can only apply on the left to a unary-expression. This doesn't reflect either possible numeric precedence between + and =. [2]

    If + were of higher precedence than =, the expression would parse as:

    a = ((4 + b) = (c + 4));
    

    and if + were lower precedence, it would parse as

    (a = 4) + (b = (c + 4));
    

    [1] I just realized that I left out cast_expression but I can't be cast to put it back in; you get the idea)

    [2] Description fixed.