Search code examples
parsingantlrantlr4grammarlexer

ANTLR conditional Lexer


I have the following ANTLR grammar

relation
  : IDENTIFIER EQUAL relative_date
; 
relative_date
 : K_NOW (PLUS | MINUS) NUMERIC_LITERAL TIME_UNIT
;

IDENTIFIER 
 : //'"' (~'"' | '""')* '"'
 '`' (~'`' | '``')* '`'
 | '[' ~']'* ']'
 | [a-zA-Z_] [a-zA-Z_.0-9]* 
;

TIME_UNIT
 : ('h'|'m'|'s'|'d'|'w'|'M'|'y'|'q')
;

PLUS : '+';
MINUS : '-';
EQUAL: '=';
K_NOW : N O W;
NUMERIC_LITERAL
 : [0-9]+ ;

If I put TIME_UNIT before IDENTIFIER parser

  • something = now - 5d works
  • d = now - 5d DOES NOT work and fails at first d and says IDENTIFIER required

If I put TIME_UNIT after IDENTIFIER parser

  • something = now - 5d fails at the second d and says TIME_UNIT required
  • d = now - 5d fails at the second d and says TIME_UNIT required

Can someone help me how can I change the grammar to work in both cases? Like when it is a relative date use TIME_UNIT lexer otherwise IDENTIFIER lexer


Solution

  • ANTLR's lexer tries to match as much characters as possible. When 2 or more lexer rules match the same amount of characters, the rule defined first "wins".

    So, the input d matches both TIME_UNIT and IDENTIFIER, but because IDENTIFIER is defined first, it wins. In other words: the rule TIME_UNIT will never be matched.

    The solution, put TIME_UNIT before IDENTIFIER:

    TIME_UNIT
     : ('h'|'m'|'s'|'d'|'w'|'M'|'y'|'q')
     ;
    
    K_NOW
     : N O W
     ;
    
    IDENTIFIER 
     : //'"' (~'"' | '""')* '"'
       '`' (~'`' | '``')* '`'
     | '[' ~']'* ']'
     | [a-zA-Z_] [a-zA-Z_.0-9]* 
     ;
    

    (Note that K_NOW will also need to be placed before IDENTIFIER!)

    However, now the input d, h, m, etc. will never become an IDENTIFIER because these will now always become a TIME_UNIT. You cannot change this, that is how ANTLR's lexer works. You can handle this in the parser like this:

    identifier
     : IDENTIFIER
     | TIME_UNIT
     ;
    
    TIME_UNIT
     : ('h'|'m'|'s'|'d'|'w'|'M'|'y'|'q')
     ;
    
    IDENTIFIER 
     : //'"' (~'"' | '""')* '"'
       '`' (~'`' | '``')* '`'
     | '[' ~']'* ']'
     | [a-zA-Z_] [a-zA-Z_.0-9]* 
     ;
    

    and then use the rule identifier in your parser rules instead of IDENTIFIER:

    relation
     : identifier EQUAL relative_date
     ;