Search code examples
javaantlrantlrworks

Cannot intrepret ANTLRWorks output


I am using the following simple grammar to get an understanding of ANTLR.

grammar Example;
options {
language=Java;
}

ID  : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
    ;

INT : '0'..'9'+
    ;
PLUS    :   '+';


ADDNUM  :   
    INT PLUS INT;

prog    :    ADDNUM;

When I try running the grammar in ANTLRWorks for the input 1+2, I get the following error in the console:

[16:54:08] Interpreting... [16:54:08] problem matching token at 2:0
NoViableAltException(' '@[1:1: Tokens : ( ID | INT | PLUS | ADDNUM);])

Can anyone please help me understand where I am going wrong.


Solution

  • You probably didn't indicate prog as the starting rule in ANTLRWorks. If you do, it all goes okay.

    But you really shouldn't create a lexer rule that matches an expression like you do in ADDNUM: this should be a parser rule:

    grammar Example;
    
    prog    : addExpr EOF;
    addExpr : INT PLUS INT;
    ID      : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
    INT     : '0'..'9'+;
    PLUS    : '+';
    

    ANTLR rules

    There are no strict rules when to use parser-, lexer- or fragment rules, but here's what they're usually used for:

    lexer rules

    A lexer rule is usually the smallest part of a language (a string, a numbers, an identifier, a comment, etc.). Trying to create a lexer rule from input like 1+2 causes problems because:

    • if you ever want to extract something meaningful from that token (evaluate it, for example), you need to split the contents of that token because after creating 1 token from it, the text from the entire expression is "glued" together;
    • you run into problems when there are white-space in between it: 1 +   2.

    The expression 1+2 are three tokens: INT, PLUS and another INT.

    fragment rules

    A fragment rule is used when you don't want this rule to ever because a "real" token. For example, take the following lexer rules:

    ID    : ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
    FLOAT : '0'..'9'+ '.' '0'..'9'+; 
    INT   : '0'..'9'+;
    

    In the rules above, you're using '0'..'9' four times, so you could place that in a separate rule

    ID    : ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | DIGIT)*
    FLOAT : DIGIT+ '.' DIGIT+; 
    INT   : DIGIT+;
    DIGIT : '0'..'9';
    

    But you don't want to ever create a DIGIT token: you only want the DIGIT to be used by other lexer rules. In that case, you can create a fragment rule:

    ID    : ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | DIGIT)*
    FLOAT : DIGIT+ '.' DIGIT+; 
    INT   : DIGIT+;
    fragment DIGIT : '0'..'9';
    

    This will make sure there will never be a DIGIT token: and can therefor never use this in your parser rule(s)!

    parser rules

    Parser rules glue the tokens together: they make sure the language is syntactic valid (a.k.a. parsing). To emphasize, parser rules can use other parser rules or lexer rules, but not fragment rules.


    Also see: ANTLR: Is there a simple example?