Search code examples
parsingantlrantlrworks

Why this error happened? - "The following alternatives can never be matched"


I'm interested in making compiler, parser, and parser generator, but I don't know much about them.

After reading an answer of this question, I tried to make a 'very' simple LaTeX parser.

This is the code:

grammar Latex;

latex   :   ITEM*;
ITEM    :   CMD|LAWTEXT;
CMD :   CHEAD ARGS;
CHEAD   :   '\\' LETTER(LETTER|DIGIT)*;
LETTER  :   'A'..'Z'|'a'..'z';
DIGIT   :   '0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9';
ARGS    :   '{' ITEM* '}';
LAWTEXT :   (LETTER|DIGIT|WHITESPACE|PUNC)*;
WHITESPACE
    :   ' '|'\t'|'\n'|'\r';
PUNC    :   '!'|'^';

(There's only two char in PUNC, for test purpose)

And this is the error message:

[18:39:09] warning(200): C:\Users\***\Documents\Latex.g:9:12: Decision can match input such as "{'\t'..'\n', '\r', ' '..'!', '0'..'9', 'A'..'Z', '\\', '^', 'a'..'z', '}'}" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
[18:39:09] error(201): C:\Users\***\Documents\Latex.g:9:12: The following alternatives can never be matched: 2

[18:39:09] error(211): C:\Users\***\Documents\Latex.g:1:8: [fatal] rule Tokens has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2.  Resolve by left-factoring or using syntactic predicates or using backtrack=true option.

I found that this error happens because there's an ambiguity, a code can be interpreted in more than two ways, but I have no idea how this ambiguity is generated.

And this is the diagram and two ways something can be interpreted (maybe).

The diagram.

... but how \ and } can be confused?


Solution

  • JiminP wrote:

    I'm interested in making compiler, parser, and parser generator, but I don't know much about them.

    ANTLR creates a lexer and parser for you based on the grammar you write. ANTLR itself is the parser generator, so you don't write the parser generator itself (luckily!). A compiler is an application that takes the tree your parser generates and translates the input into some other form: this is something you need to do yourself. So, to emphasize: ANTLR only helps you create a parser for your language, the rest is up to you.

    Now, the problem(s).

    Your grammar contains almost only lexer rules. Lexer rules start with a capital letter and are used to tokenize your input source. So rules like these:

    LETTER  :   'A'..'Z'|'a'..'z';
    ...
    LAWTEXT :   (LETTER|DIGIT|WHITESPACE|PUNC)*;
    

    might cause the lexer to create a LETTER token on its own. If you always want a lower- or uppercase ascii letter to become a LAWTEXT token, then you need to make LETTER a fragment rule like this:

    fragment LETTER  :   'A'..'Z'|'a'..'z';
    ...
    LAWTEXT :   (LETTER|DIGIT|WHITESPACE|PUNC)+;
    

    And as you can see, I ended the LAWTEXT rule with a + instead of a *: you don't want to create tokens that contain nothing (empty string).

    Also, args, item and cmd are no suitable candidates for lexer rule: they should be parser rules instead.

    Here's a grammar that generates a lexer and parser without any errors:

    grammar Latex;
    
    latex
      :  item* EOF
      ;
    
    item 
      :  cmd
      |  LAWTEXT
      ;
    
    cmd
      :  CHEAD args
      ;
    
    args
      :  '{' item* '}'
      ;
    
    CHEAD 
      :  '\\' LETTER (LETTER | DIGIT)*
      ;  
    
    LAWTEXT
      :  (LETTER | DIGIT | WHITESPACE | PUNC)+
      ;
    
    fragment  
    WHITESPACE 
      :  ' ' | '\t' | '\n' | '\r'
      ;
    
    fragment  
    PUNC       
      : '!' | '^'
      ;
    
    fragment
    LETTER
      :  'A'..'Z' | 'a'..'z'
      ;
    
    fragment
    DIGIT
      :  '0'..'9'
      ;
    

    EDIT

    As I already mentioned: lexer rules start with a capital letter while parser rules start with a lower case letter. A lexer, which is sometimes called a tokenizer or scanner, is responsible for chopping up the input source. The input source starts of as being just a stream of characters. These characters are then grouped together by the lexer. So, given the following lexer rules:

    Identifier
      :  (Letter | '_') (Letter | '_' | Digit)*
      ;
    
    Assign
      :  '='
      ;
    
    Number
      :  Digit+ ('.' Digit+)?
      ;
    
    fragment Digit
      :  '0'..'9'
      ;
    
    fragment Letter
      :  'a'..'z' | 'A'..'Z'
      ;
    
    Spaces
      :  (' ' | '\t' | '\r' | '\n') {skip();}
      ;
    

    could take input source like:

    foo = 12.34
    

    which the lexer sees as:

    'f', 'o', 'o', ' ', '=', ' ', '1', '2', '.', '3', '4', EOF
    

    and will create the following tokens:

    • Identifier "foo"
    • Assign "="
    • Number "12.34"

    (note that there are no tokens being created from the white spaces: I skipped these!)

    After the lexer has created tokens from your input source, the parser is passed these tokens. An assignment parser rule could then look like:

    assignment
      :  Identifier Assign Number
      ;
    

    It is important to keep in mind that the input source is first tokenized by the lexer and only after that process, the parser rules come into play.