I have a grammar which (if you simplify it quite a bit) looks like this:
options
{
backtrack=true;
}
// parser
text: (TEXT)+;
and_level2_thing: text | '(' and_thing ')';
and_thing: and_level2_thing (OP_AND and_level2_thing)*;
and_expression: and_thing (OP_AND and_thing)*;
parse_starts_here: and_expression EOF;
// lexer
OP_AND : 'AND';
TEXT : ( '"' (~'"')* '"' );
It has two types of groups of expressions top-level (and_thing
) and inner-level (and_level2_thing
) for which different rules apply but both levels have to support AND e.g. TOP_TYPE_EXPRESSION AND TOP_TYPE_EXPRESSION
and
TOP_TYPE_EXPRESSION AND (INNER_TYPE_EXPRESSION AND INNER_TYPE_EXPRESSION)
.
When I have a value of the form:
(TOP_TYPE_EXPRESSION AND (TOP_TYPE_EXPRESSION AND (TOP_TYPE_EXPRESSION AND (TOP_TYPE_EXPRESSION))))
the time starts to become exponential on the nesting level presumably because the AND is ambiguous. This expression evaluates instantly:
TOP_TYPE_EXPRESSION AND TOP_TYPE_EXPRESSION AND TOP_TYPE_EXPRESSION AND TOP_TYPE_EXPRESSION
If you say this isn't a well designed language - I perfectly agree, but that's what I have right now :). Any ideas how to avoid this problem?
Your grammar is ambiguous:
"a" AND "b"
can be matched as
parse_starts_here
and_expression
and_thing
and_level2_thing
text
OP_AND
and_level2_thing
text
or as
parse_starts_here
and_expression
and_thing
and_level2_thing
text
OP_AND
and_thing
and_level2_thing
text
Normally, ANTLR would warn you about this ambiguity, but by declaring backtrack = true
, you effectively tell ANTLR to try all alternatives and use first that matches.
On unambiguous grammars ANTLR runs in linear time. Use of backtracking leads to potentially exponential time. memoize=true
is used to reduce the time back to linear at the cost of more memory used.
I would recommend removing the backtrack=true
option. ANTLR will then tell you where the grammar is ambiguous. You can either remove the ambiguity, or if it is not possible, use syntactic predicates only where needed to prefer one possible match to the other. memoize=true
would still help if you end up using syntactic predicates.
Edit - As to why there's backtracking even when both alternatives match:
It does not backtrack, but the time will still be exponential.
The problem is that ANTLR doesn't know it can match the first alternative until it actually tries to match it (since you didn't give it any hints). So it will first try to match the rule, and if it succeeds it will actually match it and perform all associated actions (memoize
option avoids exactly this by remembering the particular rule succeeded for the given input position and not repeating the whole matching process).
Example:
"a" AND ( "b" AND "c" )
To match this, ANTLR must:
"a"
AND
can be matched using the inner rule
AND
matches, (
means go to and_thing
and_thing
, it must:
(
and "b"
AND
can be matched using the inner rule
AND "c"
AND "c"
matches the inner ruleAND "c"
)
AND ( "b" AND "c" )
matches the inner ruleAND ( "b" AND "c" )
AND
matches, (
means go to and_thing
and_thing
, it must:
(
and "b"
AND
can be matched using the inner rule
AND "c"
AND "c"
matches the inner ruleAND "c"
)
AS the emphasized parts of the process show, ANTLR needs to match the text AND "c"
four times to match the input, while there's one level of nesting. If there was another level, this whole process would repeat twice, so ANTLR would parse the last part eight times.
One related remark - if you use syntactic predicates rather than backtrack option, you can fine-tune what the predicate contains - in some cases it needs not contain the whole rule being predicated. In your example above, you could just tell ANLTR to use the OP_AND and_level2_thing
rule whenever it encounters OP_AND
, without needing to check whether and_level2_thing
matches. Note that you can only do this because you know that either and_level2_thing
will match, or the other alternative will fail as well. If you do this wrong you end up with the parser getting lost and refusing an input that would be valid if it chose the right alternative.