javaparsingantlrantlr4

customized grammar similar to json


I'm trying to create a grammar that is similar to json but not exactly, the data is like this:

{foo=123,bar=abc}

basically, names and strings do not have double quotes and equal sign instead of comma for key value seperator, I have the grammar based off the json v4 grammar, except I have following modifications:

pair
   : STRING '=' value
   ;

value
   : STRING
   |  NUMBER ....;

STRING
    : (ESC | SAFECODEPOINT)+
    ;

But parsing above data snippet, the parser will treat the whol string as a single value, without break it into tokens. I think the problem is STRING definition. How can I fix the STRING token definition?


Solution

  • The problem is that your STRING rule is greedy, and doesn't exclude '=' for example, so everything up to the final brace is consumed by that rule.

    One way to fix that is to limit what can appear in a field name, e.g. limit to just alphabetic characters. The Antlr grammar shown below does exactly that. You can test it here. For your sample input:

    {foo=123,bar=abc}
    

    it produces a successful parse tree.

    Full grammar:

    grammar JSON;
    
    json
       : value EOF
       ;
    
    obj
       : '{' pair (',' pair)* '}'
       | '{' '}'
       ;
    
    pair
       : STRING '=' value
       ;
    
    value
       : STRING
       | NUMBER
       | obj
       ;
    
    STRING
       : [a-zA-Z]+
       ;
    
    fragment ESC
       : '\\' (["\\/bfnrt] | UNICODE)
       ;
    
    fragment UNICODE
       : 'u' HEX HEX HEX HEX
       ;
    
    fragment HEX
       : [0-9a-fA-F]
       ;
    
    fragment SAFECODEPOINT
       : ~ ["\\\u0000-\u001F]
       ;
    
    NUMBER
       : '-'? INT ('.' [0-9] +)? EXP?
       ;
    
    fragment INT
       : '0' | [1-9] [0-9]*
       ;
    
    fragment EXP
       : [Ee] [+\-]? [0-9]+
       ;
    
    WS
       : [ \t\n\r] + -> skip
       ;