Search code examples
antlrantlr3antlrworks

ANTLR lexer rule consumes characters even if not matched?


I've got a strange side effect of an antlr lexer rule and I've created an (almost) minimal working example to demonstrate it. In this example I want to match the String [0..1] for example. But when I debug the grammar the token stream that reaches the parser only contains [..1]. The first integer, no matter how many digits it contains is always consumed and I've got no clue as to how that happens. If I remove the FLOAT rule everything is fine so I guess the mistake lies somewhere in that rule. But since it shouldn't match anything in [0..1] at all I'm quite puzzled.

I'd be happy for any pointers where I might have gone wrong. This is my example:

grammar min;
options{
language = Java;
output = AST;
ASTLabelType=CommonTree;
backtrack = true;
}
tokens {
  DECLARATION;
}

declaration : LBRACEVAR a=INTEGER DDOTS b=INTEGER RBRACEVAR -> ^(DECLARATION $a $b);

EXP : 'e' | 'E';
LBRACEVAR: '[';
RBRACEVAR: ']';
DOT: '.';
DDOTS: '..';

FLOAT
    : INTEGER DOT POS_INTEGER
    | INTEGER DOT POS_INTEGER EXP INTEGER
    | INTEGER EXP INTEGER
    ;

INTEGER : POS_INTEGER | NEG_INTEGER;
fragment NEG_INTEGER : ('-') POS_INTEGER;
fragment POS_INTEGER : NUMBER+;
fragment NUMBER: ('0'..'9');

Solution

  • The '0' is discarded by the lexer and the following errors are produced:

    line 1:3 no viable alternative at character '.'
    line 1:2 extraneous input '..' expecting INTEGER
    

    This is because when the lexer encounters '0.', it tries to create a FLOAT token, but can't. And since there is no other rule to fall back on to match '0.', it produces the errors, discards '0' and creates a DOT token.

    This is simply how ANTLR's lexer works: it will not backtrack to match an INTEGER followed by a DDOTS (note that backtrack=true only applies to parser rules!).

    Inside the FLOAT rule, you must make sure that when a double '.' is ahead, you produce a INTEGER token instead. You can do that by adding a syntactic predicate (the ('..')=> part) and produce FLOAT tokens only when a single '.' is followed by a digit (the ('.' DIGIT)=> part). See the following demo:

    declaration
     : LBRACEVAR INTEGER DDOTS INTEGER RBRACEVAR
     ;
    
    LBRACEVAR : '[';
    RBRACEVAR : ']';
    DOT       : '.';
    DDOTS     : '..';
    
    INTEGER
     : DIGIT+
     ;
    
    FLOAT
     : DIGIT+ ( ('.' DIGIT)=> '.' DIGIT+ EXP? 
              | ('..')=>      {$type=INTEGER;} // change the token here
              |               EXP
              )
     ;
    
    fragment EXP   : ('e' | 'E') DIGIT+;
    fragment DIGIT : ('0'..'9');