antlrantlr4

Purpose of ANTLR lexer "tokens" section


The docs for lexer rules shows the following example for the lexer command type(...)

lexer grammar SetType;
tokens { STRING }
DOUBLE : '"' .*? '"'   -> type(STRING) ;
SINGLE : '\'' .*? '\'' -> type(STRING) ;
WS     : [ \r\t\n]+    -> skip ;

I can't find any documentation for tokens { STRING }.
What is it about? Why should I use it?


Solution

  • The tokens section defines so-called virtual tokens. They are virtual in the sense that there's no lexer rule which represents that token. You may remember that token names are derived from the lexer rule that define them.

    Sometimes, however, you need more differentiation. For example you have a lexer rule for numbers, but you wan to distinguish between SHORT, LONG, WORD etc. You can then define virtual tokens for those special values:

    tokens { SHORT, LONG, WORD }
    

    and ANLTR will define those as token types. Then you can write a number rule that returns one of those virtual types. Like:

    NUMBER: DIGITS { this._type = this.determineNumericType(this.text); };
    

    Note: the code given above is used in a JS/TS environment. You have to adjust it to your target language.

    By default the lexer would assign the token value NUMBER to the captured digits, but with the action you can assign any token value you want.