Search code examples
parsingantlrantlr4grammarlexer

ANTLR4 matches to lexer rule instead of parser rule


Here is my short ANTLR4 language:

grammar test;

prog: (decl | expr)+
;
decl: doc | quiz
;

doc: '%doc' paramlist
;

quiz: '%quiz' paramlist STR? '%quiz' ENDL
;
paramlist: '(' VAR '=' PARAMVAL {, VAR '=' PARAMVAL}')'
;
expr:expr '\*' expr
|expr '+' expr
|expr '-' expr
|DOC
;

// tokens
DOC: 'doc';
PERCENT: '%';
VAR:  \[a-zA-Z\_\]\[a-zA-Z0-9\_\]\* ;
PARAMVAL: \[^,\]+|'"'\[^"\]\*'"' ;
STR: (\~\["\\\\r\\n\] | EscapeSequence)+ ;
fragment EscapeSequence:
'\\' 'u005c'? \[btnfr"'\\\]
| '\\' 'u005c'? (\[0-3\]? \[0-7\])? \[0-7\]
| '\\' 'u'+ HexDigit HexDigit HexDigit HexDigit;
fragment HexDigit: \[0-9a-fA-F\];
ENDL: '\n' ;
WS: [ \t\n]+ -> skip;

In order to use the doc parser rule, I write '%doc', which ANTLR recognizes according to this screenshot.

%doc

However, when I try to fill in the missing PARAMVAL, the parse tree instead recognizes everything as STR.

%doc(
%doc(v=^)

Same case with quiz.
%quiz
%quiz(

It works when you add a delimiter around the STR rule. I would like to use the STR rule without a delimiter, however.

Why is the STR rule being recognized when there is no usage of STR from any of the parser rules? (Barring quiz, but that's in the middle of the rule, rather.


Solution

  • As mentioned by 500 - Internal Server Error in the comments: the lexer works independently from the parser. The lexer follows 2 rules:

    1. try to consume as many characters as possible for a lexer rule
    2. when 2 (or more) lexer rules match the same characters, let the rule defined first "win"

    Because of the first rule, it is clear that the input "%doc(v=^)" becomes a STR token.

    Some other things that are incorrect, or are working differently than you might think: when defining literal tokens inside parser rules, ANTLR creates lexer rules automatically. This means that if you do:

    doc
     : '%doc' paramlist
     ;
    
    DOC     : 'doc';
    PERCENT : '%';
    

    ANTLR will create this behind the scenes:

    doc
     : T__0 paramlist
     ;
    
    T__0    : '%doc';
    DOC     : 'doc';
    PERCENT : '%';
    

    and because of rule 1, the input "%doc" will always become a T__0 token, and never PERCENT and DOC tokens.

    Also, [^,] does not match any character other than a comma: it matches either a ^ or a ,. You probably meant ~[,]. But be careful: doing ~[,]+ will again (like STR) match far too many characters.