Search code examples
antlrantlr4parenthesesbnfebnf

Understanding ANTLR/EBNF parentheses


The ANTLR Reference Guide says about parentheses as follows:

enter image description here

On the other hand, I have the following grammar:

fragment LOWERCASE  : [a-z] ;
fragment UPPERCASE  : [A-Z] ;
fragment DIGIT : [0-9] ;

WORD                : (LOWERCASE | UPPERCASE | DIGIT )+;

If I run the string "asd90", it will be matched as a WORD as below:

>java org.antlr.v4.gui.TestRig Myg myg -tokens
asd90
[@0,0:4='asd90',<WORD>,1:0]
[@1,5:6='\r\n',<NEWLINE>,1:5]
[@2,7:6='<EOF>',<EOF>,2:0]

This causes me to be confused as I would expect "asd90" to not be matched since in my grammar, LOWERCASE and DIGIT are 2 different subrules, so i would have expected it to be not matched as "WORD".

In other terms it's like saying that "intvoid" would work as a return type where clearly it would not.


Solution

  • You seem to miss the loop operator in your consideration. The lexer rule WORD can match a sequence of any LOWERCASE, UPPERCASE and DIGIT, which means it is totally valid to match asd90 by using 5 runs of that loop (matching the lower case rule 3 times and the digits rule twice).

    The example from the book has no loop, so it can only match once, either a type or void.

    A few more details: fragment rules are kinda private rules (and in fact in older ANLTR versions there was the keyword private, which later was renamed to fragment). However, they are not available in the parser and do not appear in the list of lexer rules. Your WORD rule is stored in the ATN like so:

    enter image description here

    which shows that the 3 fragment rules are called like any other rule.

    But there's another difference (and that's what Pavel mentions). Fragment lexer rules are handled a bit differently when it comes to ambiquity resolution.

    Normally the resolution is like this: The rule which matches the longest input wins. If two rules match the same input then the one which comes first in the grammar wins. This holds true for both parser and lexer rules. However, not for fragment rules. Even if a fragment rule that matches the same input as another (non-fragment) rule appears first in the grammar, the non-fragment rule still wins.

    You can prove that by changing your grammar a bit:

    grammar Example;
    
    start: WORD;
    
    WS : [ \t\r\n]+ -> skip;
    
    fragment LOWERCASE  : [a-z]+ ;
    fragment UPPERCASE  : [A-Z]+ ;
    fragment DIGIT : [0-9]+ ;
    WORD: (LOWERCASE | UPPERCASE | DIGIT )+;
    

    This gives you these tokens:

    enter image description here

    With the input abc you will still get WORD as matched token, even though LOWERCASE comes first in the grammar. Now remove the fragment keyword and you will see that LOWERCASE is matched:

    Parser error (1, 1): mismatched input 'abc' expecting WORD
    
    Tokens:
    [@0,0:2='abc',<2>,1:0]
    [@1,4:3='<EOF>',<-1>,2:0]
    
    Parse Tree:
    start (
      <Error>"abc"
    )
    

    which leads to a syntax error, because a WORD was expected. The token list becomes then:

    enter image description here

    However, even if you remove all fragment keywords, you will still get WORD for input like abc90, because that rules matches more than any of the invidiual lexer rules before it.