The input content:
The grammar:
grammar test;
p : EOF;
Char : [a-z];
fragment Tab : '\t';
fragment Space : ' ';
T1 : (Tab|Space)+ ->skip;
T2 : '#' T1+ Char+;
The matching result is this:
[@0,0:6='# abc',<T2>,1:0] <<<<<<<< PLACE 1
[@1,7:6='<EOF>',<EOF>,1:7]
line 1:0 extraneous input '# abc' expecting <EOF>
Please ignore the error in the last line. I am wondering why the token matched at PLACE 1 is T2
.
In the grammar file, the T2
lexer rule goes after the T1
lexer rule. So I expect T1
rule should get applied first. So why the spaces in # abc
is not skipped?
Does ANTLR uses some greedy strategy to match current character stream with the longest lexer rule?
Three rules apply, in this order:
#
in your grammar) next.After much wee-hours searching, I found again most of this material in one lengthy quote from Sam Harwell in which he also expounds on the impact of greedy operators. I remember seeing it the first time and sketching the notes in my copy of TDAR, but without the reference.
ANTLR 4 lexers normally operate with longest-match-wins behavior, without any regard for the order in which alternatives appear in the grammar. If two lexer rules match the same longest input sequence, only then is the relative order of those rules compared to determine how the token type is assigned.
The behavior within a rule changes as soon as the lexer reaches a non-greedy optional or closure. From that moment forward to the end of the rule, all alternatives within that rule will be treated as ordered, and the path with the lowest alternative wins. This seemingly strange behavior is actually responsible for the non-greedy handling due to the way we order alternatives in the underlying ATN representation. When the lexer is in this mode and reaches the block (ESC|.), the ordering constraint requires it use ESC if possible.
The "implicit token" rule is from here.