Understanding ANTLR/EBNF parentheses

The ANTLR Reference Guide says about parentheses as follows:

On the other hand, I have the following grammar:

fragment LOWERCASE  : [a-z] ;
fragment UPPERCASE  : [A-Z] ;
fragment DIGIT : [0-9] ;

WORD                : (LOWERCASE | UPPERCASE | DIGIT )+;

If I run the string "asd90", it will be matched as a WORD as below:

>java org.antlr.v4.gui.TestRig Myg myg -tokens
asd90
[@0,0:4='asd90',<WORD>,1:0]
[@1,5:6='\r\n',<NEWLINE>,1:5]
[@2,7:6='<EOF>',<EOF>,2:0]

This causes me to be confused as I would expect "asd90" to not be matched since in my grammar, LOWERCASE and DIGIT are 2 different subrules, so i would have expected it to be not matched as "WORD".

In other terms it's like saying that "intvoid" would work as a return type where clearly it would not.

Solution

You seem to miss the loop operator in your consideration. The lexer rule WORD can match a sequence of any LOWERCASE, UPPERCASE and DIGIT, which means it is totally valid to match asd90 by using 5 runs of that loop (matching the lower case rule 3 times and the digits rule twice).

The example from the book has no loop, so it can only match once, either a type or void.

A few more details: fragment rules are kinda private rules (and in fact in older ANLTR versions there was the keyword private, which later was renamed to fragment). However, they are not available in the parser and do not appear in the list of lexer rules. Your WORD rule is stored in the ATN like so:

which shows that the 3 fragment rules are called like any other rule.

But there's another difference (and that's what Pavel mentions). Fragment lexer rules are handled a bit differently when it comes to ambiquity resolution.

Normally the resolution is like this: The rule which matches the longest input wins. If two rules match the same input then the one which comes first in the grammar wins. This holds true for both parser and lexer rules. However, not for fragment rules. Even if a fragment rule that matches the same input as another (non-fragment) rule appears first in the grammar, the non-fragment rule still wins.

You can prove that by changing your grammar a bit:

grammar Example;

start: WORD;

WS : [ \t\r\n]+ -> skip;

fragment LOWERCASE  : [a-z]+ ;
fragment UPPERCASE  : [A-Z]+ ;
fragment DIGIT : [0-9]+ ;
WORD: (LOWERCASE | UPPERCASE | DIGIT )+;

This gives you these tokens:

With the input abc you will still get WORD as matched token, even though LOWERCASE comes first in the grammar. Now remove the fragment keyword and you will see that LOWERCASE is matched:

Parser error (1, 1): mismatched input 'abc' expecting WORD

Tokens:
[@0,0:2='abc',<2>,1:0]
[@1,4:3='<EOF>',<-1>,2:0]

Parse Tree:
start (
  <Error>"abc"
)

which leads to a syntax error, because a WORD was expected. The token list becomes then:

However, even if you remove all fragment keywords, you will still get WORD for input like abc90, because that rules matches more than any of the invidiual lexer rules before it.