I'm writing lexical analyzer in Swift for Swift. I used ANTLR's grammar, but I faced with problem that I don't understand how ANTLR decides whether terminals should be separated with whitespaces.
Here's the grammar: https://github.com/antlr/grammars-v4/blob/master/swift/Swift.g4
Assume we have casting in Swift. It can also operate with optional types (Int?, String?) and with non-optional types (Int, String). Here are valid examples: "as? Int", "as Int", "as?Int". Invalid examples: "asInt" (it isn't a cast). I've implemented logic, when terminals in grammar rules can be separated with 0 or more WS (whitespace) symbols. But with this logic "asInt" is matching a cast, because it contains "as" and a type "Int" and have 0 or more WS symbols. But it should be invalid.
Swift grammar contains these rules:
DOT : '.' ;
LCURLY : '{' ;
LPAREN : '(' ;
LBRACK : '[' ;
RCURLY : '}' ;
RPAREN : ')' ;
RBRACK : ']' ;
COMMA : ',' ;
COLON : ':' ;
SEMI : ';' ;
LT : '<' ;
GT : '>' ;
UNDERSCORE : '_' ;
BANG : '!' ;
QUESTION: '?' ;
AT : '@' ;
AND : '&' ;
SUB : '-' ;
EQUAL : '=' ;
OR : '|' ;
DIV : '/' ;
ADD : '+' ;
MUL : '*' ;
MOD : '%' ;
CARET : '^' ;
TILDE : '~' ;
It seems that all these terminals can be separated with other's with 0 WS symbols, and other terminals don't (e.g. "as" + Identifier).
Am I right? If I'm right, the problem is solved. But there may be more complex logic.
Now if I have rules
WS : [ \n\r\t\u000B\u000C\u0000]+
a : 'str1' b
b : 'str2' c
c : '+' d
d : 'str3'
I use them as if they were these rules:
WS : [ \n\r\t\u000B\u000C\u0000]+
a : WS? 'str1' WS? 'str2' WS? '+' WS? 'str3' WS?
And I suppose that they should be like these (I don't know and that is the question):
WS : [ \n\r\t\u000B\u000C\u0000]+
a: 'str1' WS 'str2' WS? '+' WS? 'str3'
(notice WS is not optional between 'str1' and 'str2')
So there's 2 questions:
Thanks.
Here's the ANTLR WS
rule in your Swift grammar:
WS : [ \n\r\t\u000B\u000C\u0000]+ -> channel(HIDDEN) ;
The -> channel(HIDDEN)
instruction tells the lexer to put these tokens on a separate channel, so the parser won't see them at all. You shouldn't litter your grammar with WS
rules - it'd become unreadable.
ANTLR works in two steps: you have the lexer and the parser. The lexer produces the tokens, and the parser tries to figure out a concrete syntax tree from these tokens and the grammar.
The lexer in ANTLR works like this:
'as'
) are turned into implicit lexer rules (equivalent to TOKEN_AS: 'as';
except the name will be just 'as'
). These end up first in the lexer rules list.Let's see the consequences of these when lexing as?Int
(with a space at the end):
a
... potentially matches Identifier
and 'as'
as
... potentially matches Identifier
and 'as'
as?
does not match any lexer ruleTherefore, you consume as
, which will become a token. Now you have to decide which will be the token type. Both Identifier
and 'as'
rules match. 'as'
is an implicit lexer rule, and considered to appear first in the grammar, therefore it takes precedence. The lexer emits a token with text as
of type 'as'
.
Next token.
?
... potentially matches the QUESTION
rule?I
doesn't match any ruleTherefore, you consume ?
from the input and emit a token of type QUESTION
with text ?
.
Next token.
I
... potentially matches Identifier
In
... potentially matches Identifier
Int
... potentially matches Identifier
Int
(followed by a space) does not match anythingTherefore, you consume Int
from the input and emit a token of type Identifier
with text Int
.
Next token.
WS
rule.You consume that space, and emit a WS
token on the HIDDEN
channel. The parser won't see this.
Now let's see how asInt
is tokenized.
a
... potentially matches Identifier
and 'as'
as
... potentially matches Identifier
and 'as'
asI
... potentially matches Identifier
asIn
... potentially matches Identifier
asInt
... potentially matches Identifier
asInt
followed by a space doesn't match any lexer rule.Therefore, you consume asInt
from the input stream, and emit an Identifier
token with text asInt
.
The parser stage is only interested in the token types it gets. It does not care about what text they contain. Tokens outside the default channel are ignored, which means the following inputs:
as?Int
- tokens: 'as'
QUESTION
Identifier
as? Int
- tokens: 'as'
QUESTION
WS
Identifier
as ? Int
- tokens: 'as'
WS
QUESTION
WS
Identifier
Will all result in the parser seeing the following token types: 'as'
QUESTION
Identifier
, as WS
is on a separate channel.