swift antlr antlr4 lexical-analysis language-recognition

How ANTLR decides whether terminals should be separated with whitespaces or not?

I'm writing lexical analyzer in Swift for Swift. I used ANTLR's grammar, but I faced with problem that I don't understand how ANTLR decides whether terminals should be separated with whitespaces.

Here's the grammar: https://github.com/antlr/grammars-v4/blob/master/swift/Swift.g4

Assume we have casting in Swift. It can also operate with optional types (Int?, String?) and with non-optional types (Int, String). Here are valid examples: "as? Int", "as Int", "as?Int". Invalid examples: "asInt" (it isn't a cast). I've implemented logic, when terminals in grammar rules can be separated with 0 or more WS (whitespace) symbols. But with this logic "asInt" is matching a cast, because it contains "as" and a type "Int" and have 0 or more WS symbols. But it should be invalid.

Swift grammar contains these rules:

DOT     : '.' ;
LCURLY  : '{' ;
LPAREN  : '(' ;
LBRACK  : '[' ;
RCURLY  : '}' ;
RPAREN  : ')' ;
RBRACK  : ']' ;
COMMA   : ',' ;
COLON   : ':' ;
SEMI    : ';' ;
LT      : '<' ;
GT      : '>' ;
UNDERSCORE : '_' ;
BANG    : '!' ;
QUESTION: '?' ;
AT      : '@' ;
AND     : '&' ;
SUB     : '-' ;
EQUAL   : '=' ;
OR      : '|' ;
DIV     : '/' ;
ADD     : '+' ;
MUL     : '*' ;
MOD     : '%' ;
CARET   : '^' ;
TILDE   : '~' ;

It seems that all these terminals can be separated with other's with 0 WS symbols, and other terminals don't (e.g. "as" + Identifier).

Am I right? If I'm right, the problem is solved. But there may be more complex logic.

Now if I have rules

WS : [ \n\r\t\u000B\u000C\u0000]+
a : 'str1' b
b : 'str2' c
c : '+' d
d : 'str3'

I use them as if they were these rules:

WS : [ \n\r\t\u000B\u000C\u0000]+
a : WS? 'str1' WS? 'str2' WS? '+' WS? 'str3' WS?

And I suppose that they should be like these (I don't know and that is the question):

WS : [ \n\r\t\u000B\u000C\u0000]+
a: 'str1' WS 'str2' WS? '+' WS? 'str3'

(notice WS is not optional between 'str1' and 'str2')

So there's 2 questions:

Am I right?
What I missed?

Thanks.

Solution

Here's the ANTLR WS rule in your Swift grammar:

WS : [ \n\r\t\u000B\u000C\u0000]+               -> channel(HIDDEN) ;

The -> channel(HIDDEN) instruction tells the lexer to put these tokens on a separate channel, so the parser won't see them at all. You shouldn't litter your grammar with WS rules - it'd become unreadable.

ANTLR works in two steps: you have the lexer and the parser. The lexer produces the tokens, and the parser tries to figure out a concrete syntax tree from these tokens and the grammar.

The lexer in ANTLR works like this:

Consume characters as long as they match any lexer rule.
If several rules match the text you've consumed, use the first one which appears in the grammar
Literal strings in the grammar (like 'as') are turned into implicit lexer rules (equivalent to TOKEN_AS: 'as'; except the name will be just 'as'). These end up first in the lexer rules list.

Example 1

Let's see the consequences of these when lexing as?Int (with a space at the end):

a... potentially matches Identifier and 'as'
as... potentially matches Identifier and 'as'
as? does not match any lexer rule

Therefore, you consume as, which will become a token. Now you have to decide which will be the token type. Both Identifier and 'as' rules match. 'as' is an implicit lexer rule, and considered to appear first in the grammar, therefore it takes precedence. The lexer emits a token with text as of type 'as'.

Next token.

?... potentially matches the QUESTION rule
?I doesn't match any rule

Therefore, you consume ? from the input and emit a token of type QUESTION with text ?.

Next token.

I... potentially matches Identifier
In... potentially matches Identifier
Int... potentially matches Identifier
Int (followed by a space) does not match anything

Therefore, you consume Int from the input and emit a token of type Identifier with text Int.

Next token.

You have a space there, it matches the WS rule.

You consume that space, and emit a WS token on the HIDDEN channel. The parser won't see this.

Example 2

Now let's see how asInt is tokenized.

a... potentially matches Identifier and 'as'
as... potentially matches Identifier and 'as'
asI... potentially matches Identifier
asIn... potentially matches Identifier
asInt... potentially matches Identifier
asInt followed by a space doesn't match any lexer rule.

Therefore, you consume asInt from the input stream, and emit an Identifier token with text asInt.

The parser

The parser stage is only interested in the token types it gets. It does not care about what text they contain. Tokens outside the default channel are ignored, which means the following inputs:

as?Int - tokens: 'as' QUESTION Identifier
as? Int - tokens: 'as' QUESTION WS Identifier
as ? Int - tokens: 'as' WS QUESTION WS Identifier

Will all result in the parser seeing the following token types: 'as' QUESTION Identifier, as WS is on a separate channel.