Search code examples
c#antlr4grammar

Antlr4 grammar for a function with variadic argument list


I'm trying to write a grammar for a DSL of mine using antlr4. In essence I'm trying to create a DSL for describing function applications in a tree structure.

Currently, I'm failing at creating the correct grammar (or using the visitor in C# correctly) for parsing expressions like

#func1(jsonconfig)
#func1(jsonconfig, #func2(...))
#func1(#func2(...), #func3(...), ..., #func_n(...))
#func1(jsonconfig, #func2(...), #func3(...), ..., #func_n(...))

my grammar (with some parts removed for brevity)

func
    : FUNCTION_START IDENTIFIER LPAREN (config?) (argumentList?) RPAREN
    ;

argument
   : func
   ;

argumentList
   : (ARG_SEPARATOR argument)+
   | ARG_SEPARATOR? argument
   ;

config
   : json
   ;

however, when trying to parse an expression I'm getting only the first argument, not the rest.

this is my visitor:

public class DslVisitor : JustDslBaseVisitor<Instruction>
{
    public override Instruction VisitFunc(JustDslParser.FuncContext context)
    {
        var name = context.IDENTIFIER().GetText();

        var conf = context.config()?.GetText();
        var arguments = context.argumentList()?.argument() ?? Array.Empty<JustDslParser.ArgumentContext>();

        var instruction = new Instruction
        {
            Name = name,
            Config = conf == null ? null : JObject.Parse(conf),
            Bindings = arguments.Select(x => x.Accept(this)).ToList()
        };

        return instruction;
    }

    public override Instruction VisitArgument(JustDslParser.ArgumentContext context)
    {
        return context.func().Accept(this);
    }
}

I think there is probably some syntax error in the antlr definition because it fails to parse a list, but successfully parses a single item. In the past I had a slightly different syntax, but it required me to always pass a config object which doesn't fit my needs.

Thanks!


Solution

  • Your code has a few problems.

    First, you don't actually test the parse result in your code after the parse. You should add an ErrorListener and test whether the lexer and/or parser actually found errors. You can also use that to shunt the output to where ever you like.

    public class ErrorListener<S> : ConsoleErrorListener<S>
    {
        public bool had_error;
    
        public override void SyntaxError(TextWriter output, IRecognizer recognizer, S offendingSymbol, int line,
            int col, string msg, RecognitionException e)
        {
            had_error = true;
            base.SyntaxError(output, recognizer, offendingSymbol, line, col, msg, e);
        }
    }
    

    Simply create a listener, call AddErrorListener() for parser, call parse method, then test had_error for the listener. Note, you should add a listener to the lexer as well.

    Next. It took a lot of editing this C# code to actually get the input that people expect. I removed the C# escapes and reformatted it to get this for the input:

    #obj(
      #property(
        #unit(
          {"value":"phoneNumbers"}
        ),
        #agr_obj(
          #valueof(
        {"path":"$.phone_numbers"}
          ),
          #current(
        #valueof(
          {"path":"$.type"}
          ) ),
          #current(
        #valueof(
          {"path":"$.number"}
      ) ) ) ),
      #property(
        #unit(
          {"value":"addrs"}
        ),
        #agr_obj(
          #valueof(
        {"path":"$.addresses"}
          ),
          #current(
        #valueof(
          {"path":"$.type"}
          ) ),
          #current(
        #obj(
          #property(
            #unit(
              {"value":"city"}
            ),
            #valueof(
              {"path":"$.city"}
          ) ),
          #property(
            #unit(
              {"value":"country"}
            ),
            #valueof(
              {"path":"$.country"}
          ) ),
          #property(
            #unit(
              {"value":"street"}
            ),
            #str_join(
              {"separator":", "},
              #valueof(
            {"path":"$.street1"}
              ),
              #valueof(
            {"path":"$.street2"}
    ) ) ) ) ) ) ) )
    

    Third. You don't augment your grammar with an entry rule that has EOF at the end of the rule. An EOF-augmented rule forces the parser to consume all the input. Here, I just added the rule for "start":

    start : func EOF ;
    

    You will need to change your entry point to start() rather than func().

    Finally, your grammar does not recognize a json arg followed by optional func arguments. Since the first arg for a func can either be json or json , func or func, you need to make an exception for the first arg. This grammar fixes that:

    grammar JustDsl;
    
    LPAREN:             '(';
    RPAREN:             ')';
    FUNCTION_START:     '#';
    ARG_SEPARATOR:      ',';
    
    IDENTIFIER
        : [a-zA-Z] [a-zA-Z\-_] *
        ;
    
    start : func EOF ;
    
    func
        : FUNCTION_START IDENTIFIER LPAREN argumentList? RPAREN
        ;
    
    argument
       : func
       ;
    
    argumentList
       : (config config_rest)?
       | no_config_rest?
       ;
    
    config_rest
       : (ARG_SEPARATOR argument)*
       ;
    
    no_config_rest
       : argument (ARG_SEPARATOR argument)*
       ;
    
    config
       : json
       ;
    
    json
       : value
       ;
    
    obj
       : '{' pair (',' pair)* '}'
       | '{' '}'
       ;
    
    pair
       : STRING ':' value
       ;
    
    arr
       : '[' value (',' value)* ']'
       | '[' ']'
       ;
    
    value
       : STRING
       | NUMBER
       | obj
       | arr
       | 'true'
       | 'false'
       | 'null'
       ;
    
    
    STRING
       : '"' (ESC | SAFECODEPOINT)* '"'
       ;
    
    
    fragment ESC
       : '\\' (["\\/bfnrt] | UNICODE)
       ;
    fragment UNICODE
       : 'u' HEX HEX HEX HEX
       ;
    fragment HEX
       : [0-9a-fA-F]
       ;
    fragment SAFECODEPOINT
       : ~ ["\\\u0000-\u001F]
       ;
    
    
    NUMBER
       : '-'? INT ('.' [0-9] +)? EXP?
       ;
    
    
    fragment INT
       : '0' | [1-9] [0-9]*
       ;
    
    // no leading zeros
    
    fragment EXP
       : [Ee] [+\-]? INT
       ;
    
    // \- since - means "range" inside [...]
    
    WS
       : [ \t\n\r] + -> skip
       ; 
    

    Mike was on the right track (but had a typo with the functionArgs rule). But without the input, this problem was difficult to solve.