Two-level grammar with ANTLR 3

I have a grammar which (if you simplify it quite a bit) looks like this:

options
{
    backtrack=true;
}

// parser
text: (TEXT)+;

and_level2_thing: text | '(' and_thing ')';

and_thing: and_level2_thing (OP_AND and_level2_thing)*;

and_expression: and_thing (OP_AND and_thing)*;

parse_starts_here: and_expression EOF;

// lexer
OP_AND : 'AND';
TEXT : ( '"' (~'"')* '"' );

It has two types of groups of expressions top-level (and_thing) and inner-level (and_level2_thing) for which different rules apply but both levels have to support AND e.g. TOP_TYPE_EXPRESSION AND TOP_TYPE_EXPRESSION and TOP_TYPE_EXPRESSION AND (INNER_TYPE_EXPRESSION AND INNER_TYPE_EXPRESSION).

When I have a value of the form:

(TOP_TYPE_EXPRESSION AND (TOP_TYPE_EXPRESSION AND (TOP_TYPE_EXPRESSION AND (TOP_TYPE_EXPRESSION))))

the time starts to become exponential on the nesting level presumably because the AND is ambiguous. This expression evaluates instantly:

TOP_TYPE_EXPRESSION AND TOP_TYPE_EXPRESSION AND TOP_TYPE_EXPRESSION AND TOP_TYPE_EXPRESSION

If you say this isn't a well designed language - I perfectly agree, but that's what I have right now :). Any ideas how to avoid this problem?

Solution

Your grammar is ambiguous:

"a" AND "b"

can be matched as

parse_starts_here
  and_expression
    and_thing
      and_level2_thing
        text
      OP_AND
      and_level2_thing
        text

or as

parse_starts_here
  and_expression
    and_thing
      and_level2_thing
        text
    OP_AND
    and_thing
      and_level2_thing
        text

Normally, ANTLR would warn you about this ambiguity, but by declaring backtrack = true, you effectively tell ANTLR to try all alternatives and use first that matches.

On unambiguous grammars ANTLR runs in linear time. Use of backtracking leads to potentially exponential time. memoize=true is used to reduce the time back to linear at the cost of more memory used.

I would recommend removing the backtrack=true option. ANTLR will then tell you where the grammar is ambiguous. You can either remove the ambiguity, or if it is not possible, use syntactic predicates only where needed to prefer one possible match to the other. memoize=true would still help if you end up using syntactic predicates.

Edit - As to why there's backtracking even when both alternatives match:

It does not backtrack, but the time will still be exponential.

The problem is that ANTLR doesn't know it can match the first alternative until it actually tries to match it (since you didn't give it any hints). So it will first try to match the rule, and if it succeeds it will actually match it and perform all associated actions (memoize option avoids exactly this by remembering the particular rule succeeded for the given input position and not repeating the whole matching process).

Example:

"a" AND ( "b" AND "c" )

To match this, ANTLR must:

Match "a"
Decide whether the AND can be matched using the inner rule
- To do this, it tries to match the inner rule
- AND matches, ( means go to and_thing
- To match and_thing, it must:
  - Match ( and "b"
  - Decide whether the AND can be matched using the inner rule
    - To do this, it tries to match the inner rule against AND "c"
    - The predicate succeeds - AND "c" matches the inner rule
  - Match the inner rule against AND "c"
  - Match )
- The predicate succeeds - AND ( "b" AND "c" ) matches the inner rule
Match the inner rule against AND ( "b" AND "c" )
- AND matches, ( means go to and_thing
- To match and_thing, it must:
  - Match ( and "b"
  - Decide whether the AND can be matched using the inner rule
    - To do this, it tries to match the inner rule against AND "c"
    - The predicate succeeds - AND "c" matches the inner rule
  - Match the inner rule against AND "c"
  - Match )

AS the emphasized parts of the process show, ANTLR needs to match the text AND "c" four times to match the input, while there's one level of nesting. If there was another level, this whole process would repeat twice, so ANTLR would parse the last part eight times.

One related remark - if you use syntactic predicates rather than backtrack option, you can fine-tune what the predicate contains - in some cases it needs not contain the whole rule being predicated. In your example above, you could just tell ANLTR to use the OP_AND and_level2_thing rule whenever it encounters OP_AND, without needing to check whether and_level2_thing matches. Note that you can only do this because you know that either and_level2_thing will match, or the other alternative will fail as well. If you do this wrong you end up with the parser getting lost and refusing an input that would be valid if it chose the right alternative.