compiler-construction antlr antlr4 grammar compiler-optimization

Possible to reduce lookahead on potentially ambiguous expressions

This is somewhat related to my previous question Ambiguity between tuple and parenthesized expression, but now about if there is a way to improve the number of lookaheads needed to resolve what type of expression something is.

In the following grammar, a tuple is differentiated between a parenthesized expression by whether it has a comma or not.

grammar DBParser;
options { caseInsensitive=true; }
statement: select EOF;

select:
    'SELECT' expr (',' expr)*
    ('FROM' expr) ?
    ('WHERE' expr) ?
    ;

expr
    : '(' expr ')'              # parenExpression
    | '(' expr (',' expr)+ ')'  # tupleLiteralExpression
    | expr 'IN' expr            # inExpression
    | '(' select ')'            # subSelectExpression
    | Atom                      # constantExpression
    ;

Atom:
    [a-z-]+ | [0-9]+ | '\'' Atom '\''
    ;

WHITESPACE: [ \t\r\n] -> skip;

Because the first expression may be quite complex, and we don't know whether it will resolve to a tuple or not until after evaluating the first expression, we potentially need a very large lookahead to resolve this -- in the following we need 52 tokens -- in other words, the sub-select expression 'needs to finish' before it can make the determination on what expression type it is.

Related to this I have a few questions:

Is a very large lookahead number always a bad thing?
Does a larger K value (more lookaheads) always imply a slower parse?
If it is, are there any ways to resolve it. Or it's more an "it's just what it is".

Solution

There are actually two problems here.

If you turn on debugging output for the ProfilingATNSimulator, you will see that the large lookahead is required at the very first '(' of the partial input string ((select 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 1). AdaptivePredict() knows that we are trying to find an expr, but we need to decide among the various alts for the rule which to take. We need to read to the very end of the entire expression--to the matching ')'--in order to make that decision. Why?

AdaptivePredict() needs to find an 'IN' (or not) after the entire expression ((select 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 1) to decide to take (or not) the 3rd alt | expr 'IN' expr.

Part of the solution here is to rewrite the 3rd alt as | Atom 'IN' expr, which restricts the use of 'IN' to be a single atom. This seems okay based on your one example input.

The other problem is the common prefix between the 1st and 2nd alts of expr. The solution is to merge the two alts together: | '(' expr (',' expr)* ')'.

The fixed grammar is:

grammar DBParser;
options { caseInsensitive=true; }
statement: select EOF;

select:
    'SELECT' expr (',' expr)*
    ('FROM' expr) ?
    ('WHERE' expr) ?
    ;

expr
    : '(' expr (',' expr)* ')'
    | Atom 'IN' expr          
    | '(' select ')'          
    | Atom                    
    ;

Atom:
    [a-z-]+ | [0-9]+ | '\'' Atom '\''
    ;

WHITESPACE: [ \t\r\n] -> skip;

The fixed grammar has a max k of 2.

Note, I really don't understand why people like alt labeling. It clutters the grammar and forces a renaming of visitors and listeners methods. For this grammar, any alt can be determined by testing a few choice children. I removed all alt labeling accordingly.