Search code examples
parsingantlrwhitespaceparser-generatorantlr4

ANTLR4: Whitespace handling


I have seen many ANTLR grammars that use whitespace handling like this:

WS: [ \n\t\r]+ -> skip;
// or
WS: [ \n\t\r]+ -> channel(HIDDEN);

So the whitespaces are thrown away respectively send to the hidden channel.

With a grammar like this:

grammar Not;

start:      expression;
expression: NOT expression
          | (TRUE | FALSE);

NOT:    'not';
TRUE:   'true';
FALSE:  'false';
WS: [ \n\t\r]+ -> skip;

valid inputs are 'not true' or 'not false' but also 'nottrue' which is not a desired result. Changing the grammar to:

grammar Not;

start:      expression;

expression: NOT WS+ expression
          | (TRUE | FALSE);

NOT:    'not';

TRUE:   'true';
FALSE:  'false';

WS: [ \n\t\r];

fixes the problem, but i do not want to handle the whitespaces manually in each rule.

Generally i want to have a whitespace between each token with some exceptions (e.g. '!true' does not need a whitespace in between).

Is there a simple way of doing this?


Solution

  • Add an IDENTIFIER lexer rule to handle words which are not keywords.

    IDENTIFIER : [a-zA-Z]+;
    

    Now the text nottrue is a single IDENTIFIER token which your parser would not accept in place of the distinct keywords in not true.

    Make sure IDENTIFIER is defined after your other keywords. The lexer will find that both NOT and IDENTIFIER match the text not, and will assign the token type to the first one that appears in the grammar.