Search code examples
swiftantlrantlr4lexical-analysislanguage-recognition

How ANTLR decides whether terminals should be separated with whitespaces or not?


I'm writing lexical analyzer in Swift for Swift. I used ANTLR's grammar, but I faced with problem that I don't understand how ANTLR decides whether terminals should be separated with whitespaces.

Here's the grammar: https://github.com/antlr/grammars-v4/blob/master/swift/Swift.g4

Assume we have casting in Swift. It can also operate with optional types (Int?, String?) and with non-optional types (Int, String). Here are valid examples: "as? Int", "as Int", "as?Int". Invalid examples: "asInt" (it isn't a cast). I've implemented logic, when terminals in grammar rules can be separated with 0 or more WS (whitespace) symbols. But with this logic "asInt" is matching a cast, because it contains "as" and a type "Int" and have 0 or more WS symbols. But it should be invalid.

Swift grammar contains these rules:

DOT     : '.' ;
LCURLY  : '{' ;
LPAREN  : '(' ;
LBRACK  : '[' ;
RCURLY  : '}' ;
RPAREN  : ')' ;
RBRACK  : ']' ;
COMMA   : ',' ;
COLON   : ':' ;
SEMI    : ';' ;
LT      : '<' ;
GT      : '>' ;
UNDERSCORE : '_' ;
BANG    : '!' ;
QUESTION: '?' ;
AT      : '@' ;
AND     : '&' ;
SUB     : '-' ;
EQUAL   : '=' ;
OR      : '|' ;
DIV     : '/' ;
ADD     : '+' ;
MUL     : '*' ;
MOD     : '%' ;
CARET   : '^' ;
TILDE   : '~' ;

It seems that all these terminals can be separated with other's with 0 WS symbols, and other terminals don't (e.g. "as" + Identifier).

Am I right? If I'm right, the problem is solved. But there may be more complex logic.

Now if I have rules

WS : [ \n\r\t\u000B\u000C\u0000]+
a : 'str1' b
b : 'str2' c
c : '+' d
d : 'str3'

I use them as if they were these rules:

WS : [ \n\r\t\u000B\u000C\u0000]+
a : WS? 'str1' WS? 'str2' WS? '+' WS? 'str3' WS?

And I suppose that they should be like these (I don't know and that is the question):

WS : [ \n\r\t\u000B\u000C\u0000]+
a: 'str1' WS 'str2' WS? '+' WS? 'str3'

(notice WS is not optional between 'str1' and 'str2')

So there's 2 questions:

  1. Am I right?
  2. What I missed?

Thanks.


Solution

  • Here's the ANTLR WS rule in your Swift grammar:

    WS : [ \n\r\t\u000B\u000C\u0000]+               -> channel(HIDDEN) ;
    

    The -> channel(HIDDEN) instruction tells the lexer to put these tokens on a separate channel, so the parser won't see them at all. You shouldn't litter your grammar with WS rules - it'd become unreadable.

    ANTLR works in two steps: you have the lexer and the parser. The lexer produces the tokens, and the parser tries to figure out a concrete syntax tree from these tokens and the grammar.

    The lexer in ANTLR works like this:

    • Consume characters as long as they match any lexer rule.
    • If several rules match the text you've consumed, use the first one which appears in the grammar
    • Literal strings in the grammar (like 'as') are turned into implicit lexer rules (equivalent to TOKEN_AS: 'as'; except the name will be just 'as'). These end up first in the lexer rules list.

    Example 1

    Let's see the consequences of these when lexing as?Int (with a space at the end):

    • a... potentially matches Identifier and 'as'
    • as... potentially matches Identifier and 'as'
    • as? does not match any lexer rule

    Therefore, you consume as, which will become a token. Now you have to decide which will be the token type. Both Identifier and 'as' rules match. 'as' is an implicit lexer rule, and considered to appear first in the grammar, therefore it takes precedence. The lexer emits a token with text as of type 'as'.

    Next token.

    • ?... potentially matches the QUESTION rule
    • ?I doesn't match any rule

    Therefore, you consume ? from the input and emit a token of type QUESTION with text ?.

    Next token.

    • I... potentially matches Identifier
    • In... potentially matches Identifier
    • Int... potentially matches Identifier
    • Int (followed by a space) does not match anything

    Therefore, you consume Int from the input and emit a token of type Identifier with text Int.

    Next token.

    • You have a space there, it matches the WS rule.

    You consume that space, and emit a WS token on the HIDDEN channel. The parser won't see this.

    Example 2

    Now let's see how asInt is tokenized.

    • a... potentially matches Identifier and 'as'
    • as... potentially matches Identifier and 'as'
    • asI... potentially matches Identifier
    • asIn... potentially matches Identifier
    • asInt... potentially matches Identifier
    • asInt followed by a space doesn't match any lexer rule.

    Therefore, you consume asInt from the input stream, and emit an Identifier token with text asInt.

    The parser

    The parser stage is only interested in the token types it gets. It does not care about what text they contain. Tokens outside the default channel are ignored, which means the following inputs:

    • as?Int - tokens: 'as' QUESTION Identifier
    • as? Int - tokens: 'as' QUESTION WS Identifier
    • as ? Int - tokens: 'as' WS QUESTION WS Identifier

    Will all result in the parser seeing the following token types: 'as' QUESTION Identifier, as WS is on a separate channel.