Search code examples
parsingantlrantlr4ambiguity

Why is this ANTLR4 grammar ambiguous?


I am struggling to understand the ANTLR4 algorithm and how it handles left-recursion. Hoping someone can educate my a bit.

Take the below left recursive grammar:

grammar Dummy;

TOK1 : 'foo';
TOKE_OPT : 'bar';
TOK2 : 'baz';
TOKDERP : 'derp';

SPACES
 : [ \u000B\t\r\n] -> channel(HIDDEN)
 ;

rr
    : rr TOK1 rr TOKE_OPT?
    | '(' TOK2 ')'
    | TOKDERP
    ;

And the following input string:

derp foo derp foo  derp

When run through TestRig -diagnostics ANTLR concludes the grammar to be ambiguous and I don't understand why:

line 1:5 reportAttemptingFullContext d=2 (rr), input='foo'
line 1:9 reportContextSensitivity d=2 (rr), input='foo derp'
line 1:14 reportAttemptingFullContext d=2 (rr), input='foo'
line 2:0 reportAmbiguity d=2 (rr): ambigAlts={1, 2}, input='foo  derp
'

It would be greatly appreciated if someone can explain why this grammar is ambiguous and how one would get rid of the ambiguity. It's also possible that I don't understand why ambiguity means :).

If I remove the TOKE_OPT? clause the warnings go away.

I am using ANTLR version 4.7.2


Solution

  • That grammar really is ambiguous, because the grammar allows two interpretations of derp foo derp foo derp:

    (rr (rr (rr derp) foo (rr derp)) foo (rr derp))
    (rr (rr derp) foo (rr (rr derp) foo (rr derp)))
    

    (Personally, I think this whole question would be easier to read if instead of abstracting away from expressions, you'd just used plausible operator and operand tokens. But I digress.)

    Antlr4, which is a type of LL parser, cannot really handle left-recursion. It works around that by translating left-recursive rules into a simple equivalent form, effectively altering:

    rule: base
        | rule more
        ;
    

    into

    rule: base (more)* ;
    

    But that's not really sufficient to handle the typical case of left-recursive rules, which is algebraic expressions. Here, a typical grammar might be:

    expr: expr '*' expr
        | expr '+' expr
        | atom
        ;
    

    And the intention is something like:

    expr: atom ('*' atom)* ('+' ('*' atom)*)*
    

    But that's a complicated transformation and doesn't generalise well, so what Antlr really does is introduce predicates into each rule which enforce operator precedence ordering. With these predicates, the grammar becomes unambiguous, and also (usually) conforms to expectations about how expression grammars should be parsed.

    However, Antlr can only get the precedence predicates right if there is no "hidden" left- or right-recursion. ("Hidden right-recursion" doesn't mean that the recursion is hidden. What's hidden is the fact that the recursion occurs at the end of the rule.) In particular, putting an optional token at the end of a rule hides the fact that the non-terminal preceding the optional token could be right-recursive, and thus Antlr4 does not attempt to disambiguate the rule with a precedence predicate. And that leaves the grammar ambiguous.

    You can work around this by avoiding hidden right-recursion:

    rr
        : rr  TOK1 rr TOKE_OPT
        | rr  TOK1 rr
        | '(' TOK2 ')'
        | TOKDERP
        ;
    

    Now, the right-recursive rule is explicit, and the other rule (which ends with TOKE_OPT) is not ambiguous. (Or at least not ambiguous in the same way.)

    For a more precise description of the algorithm Antlr4 uses to rewrite rules, see the Appendix at the end of this technical report.