Search code examples
antlrgrammarantlr3antlrworks

ANTLR: simple example from ANTLRWorks wizard doesn't work


Grammar:

grammar test;

WS  :   ( ' '
        | '\t'
        | '\r'
        | '\n'
        ) {$channel=HIDDEN;}
    ;

STRING
    :  '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
    ;

fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;

fragment
ESC_SEQ
    :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
    |   UNICODE_ESC
    |   OCTAL_ESC
    ;

fragment
OCTAL_ESC
    :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7')
    ;

fragment
UNICODE_ESC
    :   '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
    ;

start 
    :   STRING EOF;

It is grammar generated with wizard; I added rule 'start'.

Input in interpreter:

"abc"

Result in console:

[19:09:54] Interpreting...
[19:09:54] problem matching token at 1:2 MismatchedTokenException(97!=34)
[19:09:54] problem matching token at 1:3 NoViableAltException('b'@[1:1: Tokens : ( WS | STRING );])
[19:09:54] problem matching token at 1:4 NoViableAltException('c'@[1:1: Tokens : ( WS | STRING );])
[19:09:54] problem matching token at 1:5 NoViableAltException(''@[()* loopback of 11:12: ( ESC_SEQ | ~ ( '\\' | '"' ) )*])

Screenshot: http://habreffect.ru/files/200/4cac2487f/antlr.png

ANTLRWorks v1.4 Tried also from console with ANTLR v3.2, same result.

If I type "\nabc" instead of "abc", it works fine. If I put ESC_SEQ on right in STRING rule, then "abc" works, but "\nabc" fails.


Solution

  • This appears to be a bug in ANTLRWorks 1.4. You could try with ATLRWorks 1.3 (or earlier), perhaps that version works properly (I did a quick check with v1.4 only!).

    From the console, both your example strings ("abc" and "\nabc") are being parsed without any problems. Here's my test-rig and the corresponding output:

    grammar test;
    
    start 
      :  STRING {System.out.println("parsed :: "+$STRING.text);} EOF
      ;
    
    WS  
      :  (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}
      ;
    
    STRING
      :  '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
      ;
    
    fragment
    HEX_DIGIT 
      :  ('0'..'9'|'a'..'f'|'A'..'F') 
      ;
    
    fragment
    ESC_SEQ
      :  '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
      |  UNICODE_ESC
      |  OCTAL_ESC
      ;
    
    fragment
    OCTAL_ESC
      :  '\\' ('0'..'3') ('0'..'7') ('0'..'7')
      |  '\\' ('0'..'7') ('0'..'7')
      |  '\\' ('0'..'7')
      ;
    
    fragment
    UNICODE_ESC
      :  '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
      ;
    

    Note that the grammar is the same as yours, only formatted a bit different.

    And the "main" class:

    import org.antlr.runtime.*;
    
    public class Demo {
        public static void main(String[] args) throws Exception {
            ANTLRStringStream in = new ANTLRStringStream(args[0]);
            testLexer lexer = new testLexer(in);
            CommonTokenStream tokens = new CommonTokenStream(lexer);
            testParser parser = new testParser(tokens);
            parser.start();
        }
    }
    

    Now from the console you create a parser and lexer:

    java -cp antlr-3.2.jar org.antlr.Tool test.g
    

    Compile all .java source files:

    javac -cp antlr-3.2.jar *.java
    

    and run the "main" class:

    java -cp .:antlr-3.2.jar Demo \"\\nabc\"
    // output:                                   parsed :: "\nabc"
    
    java -cp .:antlr-3.2.jar Demo \"abc\"
    // output:                                   parsed :: "abc"
    

    (for Windows, replace the : with a ; in the commands above)

    Note that the command line parameters above are examples run on Bash, where the " and \ need to be escaped: this may be different on your system. But as you can see from the output: both "\nabc" and "abc" get parsed properly.

    ANTLRWorks is a great tool for editing grammar files, but (in my experience) has quite a bit of such funny bugs in it. That's why I only edit the grammar(s) with it and generate, compile and test the files on the console as I showed you.

    HTH