Search code examples
compiler-constructionantlrantlr4lexer

How does ANTLR decide which lexer rule to apply? The longest matching lexer rule wins?


The input content:

enter image description here

The grammar:

grammar test;

p : EOF;

Char : [a-z];

fragment Tab : '\t';
fragment Space : ' ';
T1 : (Tab|Space)+ ->skip;

T2 : '#' T1+ Char+;

The matching result is this:

[@0,0:6='#   abc',<T2>,1:0]    <<<<<<<< PLACE 1
[@1,7:6='<EOF>',<EOF>,1:7]
line 1:0 extraneous input '#   abc' expecting <EOF>

Please ignore the error in the last line. I am wondering why the token matched at PLACE 1 is T2.

In the grammar file, the T2 lexer rule goes after the T1 lexer rule. So I expect T1 rule should get applied first. So why the spaces in # abc is not skipped?

Does ANTLR uses some greedy strategy to match current character stream with the longest lexer rule?


Solution

  • Three rules apply, in this order:

    1. Longest match wins first.
    2. Rule matching implicit token (like # in your grammar) next.
    3. Finally, in case of a tie (by match length), the rule listed earliest among the matching rules wins.

    After much wee-hours searching, I found again most of this material in one lengthy quote from Sam Harwell in which he also expounds on the impact of greedy operators. I remember seeing it the first time and sketching the notes in my copy of TDAR, but without the reference.

    ANTLR 4 lexers normally operate with longest-match-wins behavior, without any regard for the order in which alternatives appear in the grammar. If two lexer rules match the same longest input sequence, only then is the relative order of those rules compared to determine how the token type is assigned.

    The behavior within a rule changes as soon as the lexer reaches a non-greedy optional or closure. From that moment forward to the end of the rule, all alternatives within that rule will be treated as ordered, and the path with the lowest alternative wins. This seemingly strange behavior is actually responsible for the non-greedy handling due to the way we order alternatives in the underlying ATN representation. When the lexer is in this mode and reaches the block (ESC|.), the ordering constraint requires it use ESC if possible.

    The "implicit token" rule is from here.