Search code examples
javaparsingantlr4identifier

ANTLR match identifier but not reserved keywords


I am trying to match complex numbers using different notations, one of them using the cis function as such : MODULUS cis PHASE

The problem is that my identifier rule matches the cis as well as the start of the number following it and since it's bigger than the CIS token itself it always returns an identifier token type. How could i avoid that ?

Here's the grammar :

grammar Sandbox;

input : number? CIS UNSIGNED 
    | IDENTIFIER
    ;

number : FLOAT
    | UFLOAT 
    | UINT
    | INT
    ;

fragment DIGIT : [0-9] ;

UFLOAT : UINT (DOT UINT? | 'f') ;
FLOAT : SUB UFLOAT ;
UINT : DIGITS ;
INT : SUB UINT ;
UNSIGNED : UFLOAT 
    | UINT 
    ;
DIGITS : DIGIT+ ;

// Specific lexer rules
CIS : 'cis' ;
SUB : '-' ; 
DOT : '.' ;
WS : [ \t]+ -> skip ;
NEWLINE : '\r'? '\n' ;

IDENTIFIER : [a-zA-Z_]+[a-zA-Z0-9_]* ;  // has to be after complex so i or cis doesn't match this first

Edit: The input i was trying to parse with is the complex 1+i but using it's respective modulus and phase like this : 1.4142135623730951cis0.7853981633974483

And my actual problem is that the IDENTIFIER rule matches cis0 instead of just matching the CIS lexer rule even though it's defined before it.

I vaguely know that ANTLR chooses the rule based on the biggest match, but in this case i want to avoid that =o.


Solution

  • I see two solutions here:

    1. Make the complex number a single lexer rule:
    COMPLEX:  (FLOAT | UFLOAT | UINT | INT) WS* CIS WS* UNSIGNED;
    

    which will be longer than an identifier or the pur CIS keyword (and hence matched first).

    1. A cis secquence is a keyword, when it follows a digit (with optional whitespaces between them), right? So, you could do a lookback (LA(-1) in your predicate to reject cis as identifier if that condition is true.

    I'd prefer solution 1, because the convention is that single entities (and a complex number is, like a float number or a string, a single logicial entity) are match completely in a lexer rule, not in a parser rule.