Search code examples
parsingoperator-keywordassociativity

Relation between grammar and operator associativity


Some compiler books / articles / papers talk about design of a grammar and the relation of its operator's associativity. I'm a big fan of top-down, especially recursive descent, parsers and so far most (if not all) compilers I've written use the following expression grammar:

Expr   ::= Term { ( "+" | "-" ) Term }
Term   ::= Factor { ( "*" | "/" ) Factor }
Factor ::= INTEGER | "(" Expr ")"

which is an EBNF representation of this BNF:

Expr  ::= Term Expr'
Expr' ::= ( "+" | "-" ) Term Expr' | ε
Term  ::= Factor Term'
Term' ::= ( "*" | "/" ) Factor Term' | ε
Factor = INTEGER | "(" Expr ")"

According to what I read, some regards this grammar as being "wrong" due to the change of operator associativity (left to right for those 4 operators) proven by the growing parse tree to the right instead of left. For a parser implemented through attribute grammar, this might be true as l-attribute value requires that this value created first then passed to child nodes. however, when implementing with normal recursive descent parser, it's up to me whether to construct this node first then pass to child nodes (top-down) or let child nodes be created first then add the returned value as the children of this node (passed in this node's constructor) (bottom-up). There should be something I miss here because I don't agree with the statement saying this grammar is "wrong" and this grammar has been used in many languages esp. Wirthian ones. Usually (or all?) the reading that says it promotes LR parsing instead of LL.


Solution

  • I think the issue here is that a language has an abstract syntax which is just like:

    E ::= E + E | E - E | E * E | E / E | Int | (E)
    

    but this is actually implemented via a concrete syntax which is used to specify associativity and precedence. So, if you're writing a recursive decent parse, you're implicitly writing the concrete syntax into it as you go along and that's fine, though it may be good to specify it exactly as a phrase-structured grammar as well!

    There are a couple of issues with your grammar if it is to be a fully-fledged concrete grammar. First of all, you need to add productions to just 'go to the next level down', so relaxing your syntax a bit:

    Expr ::= Term + Term | Term - Term | Term
    Term ::= Factor * Factor | Factor / Factor | Factor
    Factor ::= INTEGER | (Expr)
    

    Otherwise there's no way to derive valid sentences starting from the start symbol (in this case Expr). For example, how would you derive '1 * 2' without those extra productions?

    Expr -> Term
         -> Factor * Factor
         -> 1 * Factor
         -> 1 * 2
    

    We can see the other grammar handles this in a slightly different way:

    Expr -> Term Expr'
         -> Factor Term' Expr'
         -> 1 Term' Expr'
         -> 1 * Factor Term' Expr'
         -> 1 * 2 Term' Expr'
         -> 1 * 2 ε Expr'
         -> 1 * 2 ε ε
          = 1 * 2
    

    but this achieves the same effect.

    Your parser is actually non-associative. To see this ask how E + E + E would be parsed and find that it couldn't. Whichever + is consumed first, we get E on one side and E + E on the other, but then we're trying to parse E + E as a Term which is not possible. Equivalently, think about deriving that expression from the start symbol, again not possible.

    Expr -> Term + Term
         -> ? (can't get another + in here)
    

    The other grammar is left-associative ebcase an arbitrarily long sting of E + E + ... + E can be derived.

    So anyway, to sum up, you're right that when writing the RDP, you can implement whatever concrete version of the abstract syntax you like and you probably know a lot more about that than me. But there are these issues when trying to produce the grammar which describes your RDP precisely. Hope that helps!