Search code examples
parsingantlrgrammarabstract-syntax-treecocor

Expressions in a CoCo to ANTLR translator


I'm parsing CoCo/R grammars in a utility to automate CoCo -> ANTLR translation. The core ANTLR grammar is:

rule '=' expression '.' ;

expression
     : term ('|' term)*
         -> ^( OR_EXPR term term* )
     ;
term
     : (factor (factor)*)? ;

factor
     : symbol
     | '(' expression ')'
         -> ^( GROUPED_EXPR expression )
     | '[' expression']'
         -> ^( OPTIONAL_EXPR expression)
     | '{' expression '}'
         -> ^( SEQUENCE_EXPR expression)
     ;

symbol
     : IF_ACTION
     | ID (ATTRIBUTES)?
     | STRINGLITERAL
     ;

My problem is with constructions such as these:

CS = { ExternAliasDirective }
         { UsingDirective }
         EOF .

CS results in an AST with a OR_EXPR node although no '|' character actually appears. I'm sure this is due to the definition of expression but I cannot see any other way to write the rules.

I did experiment with this to resolve the ambiguity.

// explicitly test for the presence of an '|' character
expression
@init { bool ored = false; }
     : term {ored = (input.LT(1).Type == OR); } (OR term)*
         ->  {ored}? ^(OR_EXPR term term*)
         ->            ^(LIST term term*)

It works but the hack reinforces my conviction that something fundamental is wrong.

Any tips much appreciated.


Solution

  • Your rule:

    expression
      : term ('|' term)*
          -> ^( OR_EXPR term term* )
      ;
    

    always causes the rewrite rule to create a tree with a root of type OR_EXPR. You can create "sub rewrite rules" like this:

    expression
      :  (term -> REWRITE_RULE_X) ('|' term -> ^(REWRITE_RULE_Y))*
      ;
    

    And to resolve the ambiguity in your grammar, it's easiest to enable global backtracking which can be done in the options { ... } section of your grammar.

    A quick demo:

    grammar CocoR;
    
    options {
      output=AST;
      backtrack=true;
    }
    
    tokens {
      RULE;
      GROUP;
      SEQUENCE;
      OPTIONAL;
      OR;
      ATOMS;
    }
    
    parse
      :  rule EOF -> rule
      ;
    
    rule
      :  ID '=' expr* '.' -> ^(RULE ID expr*)
      ;
    
    expr
      :  (a=atoms -> $a) ('|' b=atoms -> ^(OR $expr $b))*
      ;
    
    atoms
      :  atom+ -> ^(ATOMS atom+)
      ;
    
    atom
      :  ID
      |  '(' expr ')' -> ^(GROUP expr)
      |  '{' expr '}' -> ^(SEQUENCE expr)
      |  '[' expr ']' -> ^(OPTIONAL expr)
      ;
    
    ID
      :  ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | '0'..'9')*
      ;
    
    Space
      :  (' ' | '\t' | '\r' | '\n') {skip();}
      ;
    

    with input:

    CS = { ExternAliasDirective }
         { UsingDirective }
         EOF .
    

    produces the AST:

    enter image description here

    and the input:

    foo = a | b ({c} | d [e f]) .
    

    produces:

    enter image description here

    The class to test this:

    import org.antlr.runtime.*;
    import org.antlr.runtime.tree.*;
    import org.antlr.stringtemplate.*;
    
    public class Main {
        public static void main(String[] args) throws Exception {
            /*
            String source = 
                    "CS = { ExternAliasDirective } \n" +
                    "{ UsingDirective }            \n" + 
                    "EOF .                           ";
            */
            String source = "foo = a | b ({c} | d [e f]) .";
            ANTLRStringStream in = new ANTLRStringStream(source);
            CocoRLexer lexer = new CocoRLexer(in);
            CommonTokenStream tokens = new CommonTokenStream(lexer);
            CocoRParser parser = new CocoRParser(tokens);
            CocoRParser.parse_return returnValue = parser.parse();
            CommonTree tree = (CommonTree)returnValue.getTree();
            DOTTreeGenerator gen = new DOTTreeGenerator();
            StringTemplate st = gen.toDOT(tree);
            System.out.println(st);
        }
    }
    

    and with the output this class produces, I used the following website to create the AST-images: http://graph.gafol.net/

    HTH


    EDIT

    To account for epsilon (empty string) in your OR expressions, you might try something (quickly tested!) like this:

    expr
      :  (a=atoms -> $a) ( ( '|' b=atoms -> ^(OR $expr $b)
                           | '|'         -> ^(OR $expr NOTHING)
                           )
                         )*
      ;
    

    which parses the source:

    foo = a | b | .
    

    into the following AST:

    enter image description here