Search code examples
antlr4grammar

Is it possible to define an ANTLR4 grammar, nestable block where begin and end are the same token?


I would like to define a grammar where the start of each block is a tagname starting with the introducer character %. The difference between the start and end is that the start would have parameters within parentheses and the end does not. For example:

%table(parameter-list)
table,data,goes,here
%table

I can write the production in ANTLR like this:

table: '%table' paramlist .*? '%table' ;
paramlist: '(' param? (',' param)* ')' ;

but while this works, it would wrongly identify the start of one table after another if the first one is missing the end table:

%table()
stuff,goes,here
%table()
second,table
%table

Since I want tables to nest, the above example is ambiguous. The start of the second table could be the end of the first, but it is not because of the parentheses.

Is it possible in ANTLR to define the end tag as '%table' which is NOT FOLLOWED BY '(' Or, must I write a grammar with a different tag for the end?

I could require that the end tag be on a line by itself, but it seems weird to arbitrarily require a newline:

table: '%table' paramlist .*? '%table' '\n' ;

Is there any clean way to do this?


Solution

  • table: '%table' paramlist .*? '%table' ;
    

    FYI: .* inside a parser rule does not match zero or more characters but zero or more tokens. Perhaps you are aware of this, perhaps not.

    paramlist: '(' param? (',' param)* ')' ;
    

    param? (',' param)* would allow , a, b to successfully match. You probably want: (param (',' param)* )?

    As for your question: let the inner part of a table not match the token %table, but only let it match an entire %table() ... %table (so let it match recursively).

    Here's a quick demo:

    grammar Table;
    
    parse
     : table EOF
     ;
    
    table
     : TABLE params table_atom* TABLE
     ;
    
    params
     : '(' ( ID ( ',' ID )* )? ')'
     ;
    
    table_atom
     : ~TABLE // match any token other than `TABLE`
     | table
     ;
    
    TABLE
     : '%table'
     ;
    
    ID
     : [a-zA-Z]+
     ;
    
    SPACE
     : [ \t\r\n] -> skip
     ;
    

    which will parse the input:

    %table(a,b)
      table,data
      %table()
        x,y,z
      %table
      goes,here
    %table
    

    as follows:

    enter image description here