Why this error happened? - "The following alternatives can never be matched"

I'm interested in making compiler, parser, and parser generator, but I don't know much about them.

After reading an answer of this question, I tried to make a 'very' simple LaTeX parser.

This is the code:

grammar Latex;

latex   :   ITEM*;
ITEM    :   CMD|LAWTEXT;
CMD :   CHEAD ARGS;
CHEAD   :   '\\' LETTER(LETTER|DIGIT)*;
LETTER  :   'A'..'Z'|'a'..'z';
DIGIT   :   '0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9';
ARGS    :   '{' ITEM* '}';
LAWTEXT :   (LETTER|DIGIT|WHITESPACE|PUNC)*;
WHITESPACE
    :   ' '|'\t'|'\n'|'\r';
PUNC    :   '!'|'^';

(There's only two char in PUNC, for test purpose)

And this is the error message:

[18:39:09] warning(200): C:\Users\***\Documents\Latex.g:9:12: Decision can match input such as "{'\t'..'\n', '\r', ' '..'!', '0'..'9', 'A'..'Z', '\\', '^', 'a'..'z', '}'}" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
[18:39:09] error(201): C:\Users\***\Documents\Latex.g:9:12: The following alternatives can never be matched: 2

[18:39:09] error(211): C:\Users\***\Documents\Latex.g:1:8: [fatal] rule Tokens has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2.  Resolve by left-factoring or using syntactic predicates or using backtrack=true option.

I found that this error happens because there's an ambiguity, a code can be interpreted in more than two ways, but I have no idea how this ambiguity is generated.

And this is the diagram and two ways something can be interpreted (maybe).

The diagram.

... but how \ and } can be confused?

Solution

JiminP wrote:

I'm interested in making compiler, parser, and parser generator, but I don't know much about them.

ANTLR creates a lexer and parser for you based on the grammar you write. ANTLR itself is the parser generator, so you don't write the parser generator itself (luckily!). A compiler is an application that takes the tree your parser generates and translates the input into some other form: this is something you need to do yourself. So, to emphasize: ANTLR only helps you create a parser for your language, the rest is up to you.

Now, the problem(s).

Your grammar contains almost only lexer rules. Lexer rules start with a capital letter and are used to tokenize your input source. So rules like these:

LETTER  :   'A'..'Z'|'a'..'z';
...
LAWTEXT :   (LETTER|DIGIT|WHITESPACE|PUNC)*;

might cause the lexer to create a LETTER token on its own. If you always want a lower- or uppercase ascii letter to become a LAWTEXT token, then you need to make LETTER a fragment rule like this:

fragment LETTER  :   'A'..'Z'|'a'..'z';
...
LAWTEXT :   (LETTER|DIGIT|WHITESPACE|PUNC)+;

And as you can see, I ended the LAWTEXT rule with a + instead of a *: you don't want to create tokens that contain nothing (empty string).

Also, args, item and cmd are no suitable candidates for lexer rule: they should be parser rules instead.

Here's a grammar that generates a lexer and parser without any errors:

grammar Latex;

latex
  :  item* EOF
  ;

item 
  :  cmd
  |  LAWTEXT
  ;

cmd
  :  CHEAD args
  ;

args
  :  '{' item* '}'
  ;

CHEAD 
  :  '\\' LETTER (LETTER | DIGIT)*
  ;  

LAWTEXT
  :  (LETTER | DIGIT | WHITESPACE | PUNC)+
  ;

fragment  
WHITESPACE 
  :  ' ' | '\t' | '\n' | '\r'
  ;

fragment  
PUNC       
  : '!' | '^'
  ;

fragment
LETTER
  :  'A'..'Z' | 'a'..'z'
  ;

fragment
DIGIT
  :  '0'..'9'
  ;

EDIT

As I already mentioned: lexer rules start with a capital letter while parser rules start with a lower case letter. A lexer, which is sometimes called a tokenizer or scanner, is responsible for chopping up the input source. The input source starts of as being just a stream of characters. These characters are then grouped together by the lexer. So, given the following lexer rules:

Identifier
  :  (Letter | '_') (Letter | '_' | Digit)*
  ;

Assign
  :  '='
  ;

Number
  :  Digit+ ('.' Digit+)?
  ;

fragment Digit
  :  '0'..'9'
  ;

fragment Letter
  :  'a'..'z' | 'A'..'Z'
  ;

Spaces
  :  (' ' | '\t' | '\r' | '\n') {skip();}
  ;

could take input source like:

foo = 12.34

which the lexer sees as:

'f', 'o', 'o', ' ', '=', ' ', '1', '2', '.', '3', '4', EOF

and will create the following tokens:

Identifier "foo"
Assign "="
Number "12.34"

(note that there are no tokens being created from the white spaces: I skipped these!)

After the lexer has created tokens from your input source, the parser is passed these tokens. An assignment parser rule could then look like:

assignment
  :  Identifier Assign Number
  ;

It is important to keep in mind that the input source is first tokenized by the lexer and only after that process, the parser rules come into play.