Search code examples
javaantlrdslantlr3antlrworks

ANTLR - Token Enumeration Mismatch Between Grammar and Tree Grammar


BackGround

I am trying to write a simple grammar, using AntlrWorks, for boolean equations that test sets of values for the existence (or lack there of) of specified elements. I have created a combined lexer/parser grammar that produces the desired ASTs. I have also written a cooresponding tree grammar, that seems to work (passes AntlrWorks' debug functions).


Problem

However, When I try to wire the two together in a test program (that is lex, parse, and tree parse in the same program), I get errors like...

node from line 1:5 required (...)+ loop did not match anything at input 'and'

and

node from after line 1:8 mismatched tree node: UP expecting <DOWN>

As a sanity test, I had the test program output the results of toStringTree() from the generated AST and toTokenTypeString() from the resulting TreeNodeStream.

What I found was that the enumerated token type values of the TreeNodeStream do not match the enumerated token type values in the autogenerated Tree Grammar code.


EXAMPLE

  • sample input: "true and false"

  • Output of toStringTree() from the Parser supplied tree: (and true false)

  • Output of toTokenTypeString() from TreeNodeStream conatining the above AST: 19 2 22 20 3 8

That token stream should be AND <DOWN> 'true' 'false' <UP> NEWLINE But the TreeParser sees it as CLOSEPAREN <DOWN> OR 'false' <UP> OPENPAREN (based off of looking at the node token type output and checking it against the enumeration defined in the tree grammar) and throws the error

1:5 required (...)+ loop did not match anything at input 'and'


Bottom Line

Why isn't my tree parser set up to properly identify my ASTs?

Below is my source. I appreciate any feedback on the foolish mistakes I must have made :)

Lexer/Parser Grammar

grammar INTc;

options {
   output=AST;
   ASTLabelType=CommonTree;
}

tokens {
   OR='or';
   AND='and';
   NOT='not';
   ALLIN='+';
   PARTIN='^';
   NOTIN='!';
   SET;
   OPENPAREN='(';
   CLOSEPAREN=')';
   OPENSET='{';
   CLOSESET='}';
}
@header {
package INTc;
}

@lexer::header {
package INTc;
}

@members {
}

/**Begin Parser Rules*/
prog:   stat+ ;

stat:   expr
    |   NEWLINE
    ;

expr
:  orExpr
;

orExpr returns [boolean value]
    :   a=andExpr(OR^ b=andExpr)*
    ;

andExpr returns [boolean value]
    :   a=notExpr (AND^ b=notExpr)*
    ; 

notExpr returns [boolean value]
    :   a=atom
    | '!' a=atom -> ^(NOT atom)
    ;

atom returns [boolean value]
    :   ALLIN  OPENSET ((INT)(','INT)*) CLOSESET   -> ^(ALLIN ^(SET INT+))
    |   PARTIN  OPENSET ((INT)(','INT)*) CLOSESET  -> ^(PARTIN ^(SET INT+))
    |   NOTIN OPENSET ((INT)(','INT)*) CLOSESET   -> ^(NOTIN  ^(SET INT+))
    |   TIMERANGE
    |   OPENPAREN! e=expr CLOSEPAREN!
    |   'true'
    |   'false'
    ;

/**Begin Lexer Rules*/
ID  :   ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')* ;
DIGIT   :   ('0'..'9');
INT :   DIGIT+ ;
NEWLINE :   '\r'? '\n' ;
WS  :   ( ' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
COMMENT
    :   '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
    |   '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
    ;

Tree Grammar

tree grammar INTcWalker;

options {
  tokenVocab=INTc;
  ASTLabelType=CommonTree;
}

@header {
  package INTc;
  import java.util.ArrayList;
  import java.util.Arrays;
}

@members {
  ArrayList<String> intSet;
  boolean isFit = false;

  public boolean getResult() {
     return isFit;   
  }
  public void setINTSet(ArrayList newSet) {
     intSet = newSet;
     isFit = false;
  }
  public ArrayList getINTSET(){return intSet;}
}

prog
:     stat+
;
stat
:     expr  {
                                     isFit = $expr.value;
                                     //System.out.println(isFit);
    }
|    NEWLINE {}
;
expr returns [boolean value]
: ^(OR a=expr b=expr){}
| ^(AND a=expr b=expr){}
| ^(NOT a=expr){}
| ^(ALLIN ^(SET INT+)){}
| ^(PARTIN ^(SET INT+)){}
| ^(NOTIN ^(SET INT+)){}
| 'true'        {$value = true;}
| 'false'       {$value = false;}
;

Test Program

public class setTest {

    public static void main(String args[]) throws Exception {
        INTcLexer lex = new INTcLexer(new ANTLRFileStream("input.txt"));
        CommonTokenStream tokens = new CommonTokenStream(lex);

        INTcParser parser = new INTcParser(tokens);
        INTcParser.prog_return r = parser.prog();
        CommonTree t  = (CommonTree)r.getTree();
        CommonTreeNodeStream nodes = new CommonTreeNodeStream(t);
        INTcWalker evaluator = new INTcWalker(nodes);

        System.out.println(t.toStringTree());

        System.out.println(nodes.toTokenTypeString());
        nodes.reset();

        try {
                evaluator.prog();
        } catch (RecognitionException e) {
                e.printStackTrace();
        }   

        System.out.println(evaluator.getResult());

    }   
}

Solution

  • If I use your combined grammar and tree grammar to create lexer, parser and tree-walker classes, and run the following class:

    import org.antlr.runtime.*;
    import org.antlr.runtime.tree.*;
    
    public class Main {
      public static void main(String args[]) throws Exception {
        INTcLexer lex = new INTcLexer(new ANTLRStringStream("true and false\n"));
        CommonTokenStream tokens = new CommonTokenStream(lex);
        INTcParser parser = new INTcParser(tokens);
    
        CommonTree t  = (CommonTree)parser.prog().getTree();
        CommonTreeNodeStream nodes = new CommonTreeNodeStream(t);
        INTcWalker evaluator = new INTcWalker(nodes);
    
        System.out.println(t.toStringTree());
    
        CommonTree tr;
        while(true) {
          Token token = ((CommonTree)nodes.nextElement()).getToken();
          if(token.getType() == INTcParser.EOF) break;
          System.out.printf("%-10s '%s'\n", INTcParser.tokenNames[token.getType()], token.getText());
        }
    
        System.out.println("\nresult=" + evaluator.getResult());
      }
    }
    

    the following is printed to the console:

    (and true false) 
    
    AND        'and'
    <DOWN>     'DOWN'
    'true'     'true'
    'false'    'false'
    <UP>       'UP'
    NEWLINE    '
    '
    
    result=false
    

    I.e.: I see the expected output:

    • the tree is okay ((and true false));
    • CommonTreeNodeStream contains the proper tokens (or better: trees);
    • and the correct value, false, is being printed without any errors from either the parser or tree walker.

    A couple of tips:

    • create tokens for both 'true' and 'false' (i.e. TRUE='true'; ...);
    • don't use literals inside your tree grammar (not 'true', but TRUE);
    • make DIGIT a fragment rule, that way it will never become a token of its own, but only used inside INT (or other lexer rules). Simply place the keyword fragment in front of it;
    • both .* and .+ are ungreedy by default, so you can remove the options greedy=false;} :.