Search code examples
compiler-constructionantlr4yaccll-grammarlalr

Is there any example of language grammar that possible for Yacc to express but impossible for Antlr4?


I try to learn about language parser recently and always seen the review about difference in Yacc and Antlr (about LALR and LL). It was always some concluded wording like "LALR is more powerful". But I can't understand what its really means

So could anyone please enlighten me what is the meaning of the word powerful here?

I just assume that it would be meant "Yacc can do something Antlr can't do", if it is I wish I could see the exact example about it


Solution

  • A language that is LR(1) but not LL(*)

    The question Language theoretic comparison of LL and LR grammars has an answer with the following language that is LR(1) but not LL(*):

    { a^i b^j | i≥j }
    

    That is, some number of a followed by an equal or lesser number of b.

    The same language is cited by this answer to the similar question Example of an LR grammar that cannot be represented by LL?. However, the present question is different because that one says "LL", meaning LL(k), whereas here we're asking about LL(*) (and Antlr4).

    Intuitive demonstration (not a proof)

    Let's intuitively argue that this is LR(1) but not LL(*).

    First, the LR(1) grammar (copied from the second linked answer):

    S ::= a S | P
    P ::= a P b | <empty>
    

    Intuitively, this is LR(1) because an LR(1) parser can push any number of a symbols onto its stack, then when it gets to the first b, begin popping the corresponding a symbols in a,b pairs, using the first production for P. If it runs out of b symbols, it pops the remaining a symbols using the first production for S. If it runs out of a symbols while there are still b symbols left, it signals an error. (Remember, in this context, we're primarily concerned with recognition, so the output is either "yes" or "error".)

    In contrast, this grammar is not LL(*). Intuitively, an LL(*) parser must decide, when it sees the first a, whether to use the first or second production of S. It would like to look ahead to see if there are as many b symbols as a symbols remaining, because if not, then it would know it has to use the first production to "burn" the excess a symbols. But LL(*) lookahead is limited to recognizing a regular language, and a regular language cannot recognize { a^i b^i } since it cannot "count".

    Of course, the fact that one grammar is not LL(*) does not imply that the language is not LL(*), because there might be a more clever grammar. To prove it is not LL(*), I would probably start with the formal definition, assume I had a grammar with those conditions, and then use a pumping lemma argument to show it can't correctly recognize the language of interest. But I'll let the linked resources suffice as rigorous justification that the language is not LL(*).

    Higher level interpretation

    The way I think about it, LL makes decisions on the way "down" the parse tree, while LR makes them on the way "up". To make a language that is not LL(k), we arrange it so a putative parser would have to commit to an interpretation of a symbol when the information it needs to do so is beyond the horizon of k symbols. To make it not LL(*), we need to put the crucial information beyond a horizon that can only be crossed by first recognizing a non-regular language.

    In contrast, LR can push symbols onto its stack, delaying their interpretation until it has seen the end of the associated production and already constructed interpretations of everything in between.

    To try to make this a bit more concrete, imagine a programming language that has two kinds of brace-enclosed things, say, code blocks and object literals (like Javascript). Imagine they both can occur in the same context (unlike Javascript):

      var x = { console.log("I am a code block"); /*result is*/ 6; };
      var x = { a:1, b:2 };
    

    In that context, a parser encounters {. LL must decide immediately whether that's the start of a code block or an object literal. In Javascript, object literal keys have to be identifiers or string literals, and the union of both of those is a regular language, so an LL(*) parser can skip past the regex for "identifier or stringlit" to check for :, which would signal object literal (otherwise code block).

      {                    // hmmm, code or object?
      { a                  // possible object literal key
      { a :                // a-ha! definitely object literal
    

    If instead a key could be an arbitrary string-typed expression, then LL(*) is in trouble because it has to balance parentheses to get past the putative key so it can check for the ::

      {                    // start of object literal?
      { (                  // uh-oh ...
      { (a                 // I'm
      { (a ?               //     getting
      { (a ? b             //             lost
      { (a ? b :           // is this the ':' after a key? help!
    

    In contrast, LR happily defers the interpretation of {, pushing it onto a stack, and proceeds, in effect, with two potential interpretations until some token disambiguates them.

    Hopefully this provides some intuition for what sorts of things LR contains but LL(*) does not.

    There are examples of the reverse (LL(*) but not LR), although I don't know what they look like ("not LR" is a hard class to think about); see the first linked question for details.

    Antlr4 semantic predicates

    Now, the question title actually asks about Antlr4. Antlr4 has semantic predicates, effectively allowing the programmer to insert arbitrary lookahead computation. So if you're willing to step outside the grammar formalism, in fact there is no limit (short of decidability) on what an Anltr4 parser can recognize.