Match anything until end tag (generic text) in simple lexer/parser using ANTLR4

I want to make a simple parser for a simple scripting language, it has text blocks and script blocks, inside those scriptblocks, I want to be able to define a function, as well as execute generic statements of any kind.

I don't really need to know or care what classifies as a "statement", but I do need to parse for function declarations. So even if it looks like a while loop and I don't have a rule for a while loop, can I match a "generic statement rule" and just get the content some how?

Using a catchall rule I am able to do the "generic text" part fine, but in script mode I'm less successful, I tried pulling off nested modes where I set an 'IN FUNCTION' mode, but kept running into road blocks.

For example, when inside a statement within my functionDeclaration , how can I match everything until the end function. Furthermore, how can I just match a "generic" statement, such that I do not ever need statement types like emptyStatement or assignmentStatement. Even if it just becomes a big "script code blob" that's fine with me.

Where I am so far:

My Grammar:

parser grammar ExprParser;
options { tokenVocab=ExprLexer; }

    : block* EOF
    : textBlock+
    | script
    : HtmlDtd
    | GenericText
    | ScriptEnd
    : topStatement+
    | statement
    : functionDeclaration

    : FunctionStart Ident L_PAREN R_PAREN statement* FunctionEnd

    : assignmentStatement
    | emptyStatement

My Lexer

lexer grammar ExprLexer;

channels { Comments, SkipChannel }

SeaWhitespace:  [ \t\r\n\f]+ -> channel(HIDDEN);
HtmlDtd:        '<!' .*? '>';
ScriptStart:       SCRIPT_START_FRAGMENT -> channel(SkipChannel), pushMode(SCRIPT);

// Catch all text
GenericText : . ; 

mode SCRIPT;
ScriptEnd :'%' '>' -> channel(SkipChannel), popMode;
ScriptWhitespace : [ \t\r\n\f]+ -> channel(SkipChannel);

// Comments begin with single quote
ScriptSingleLineComment:  '\'' -> channel(SkipChannel), pushMode(SingleLineCommentMode);
Ident : ID;

COMMA     : ',';
L_PAREN   : '(';
R_PAREN   : ')';
ASSIGNTO  : '=';

mode SingleLineCommentMode;
Comment:                 ~[\r\n?]+ -> channel(Comments);
CommentEnd:              [\r\n] -> channel(SkipChannel), popMode; // exit from comment.

// Fragments
fragment ID: [a-zA-Z0-9_\u0080-\ufffe]+;
fragment NameString: [a-zA-Z_\u0080-\ufffe][a-zA-Z0-9_\u0080-\ufffe]*;
fragment SCRIPT_START_FRAGMENT : '<%';
fragment SCRIPT_END_FRAGMENT : '%>';
fragment FUNCTION_START_FRAGMENT : 'function';
fragment FUNCTION_END_FRAGMENT : 'end function'; // Space is required here

Some test strings

<! tagsIknow >
function xxx() 'this is a comment 

  x = y;
  a = 1;

end function

a = 1;
b = 2;


  'another script
  x = 3; 'inline comment again

The kind of script I want to work with


function xxx() 
   while (true) ' notice I have no rule for a while loop
     get me everything in here verbatim except for comments ' this ideally is trimmed
end function ' I want everything until the 'end function' keyword, basically


more generic text


My goal is for input like this


arbitrary script lines1
arbitrary script lines2

function x(a,b) 
   arbitrary script body containing anything
end function

arbitrary script lines3 again
plain text
arbitrary script lines4 again

function y() 
    different function body
end function

So I get this:

SCRIPT_BLOB (matching script lines 1 & 2 together)
  name: x
  params: [a, b]
  body: SCRIPT_BLOB (containing the body)
SCRIPT_BLOB (matching line 3)
PLAIN_TEXT_BLOB (matching 'plain text')
SCRIPT_BLOB (matching line 4)
  name: y
  params: []
  body: SCRIPT_BLOB (containing the body)

So in theory just three "types", plain texts, script objects (multiple lines), and functions (which themselves contain some params and a single script object)

Such that given the above objects I can maintain order which I encountered and handle appropriately, pushing "PLAIN TEXT" out raw, running "non-function scripts" in order, and declaring functions in order.

The problem is I cannot seem to capture things like the function name or the parameters while I have a greedy rule (this is due to ANTLR overriding those rules with most greedy one), so I cannot have a rule for paramters which is confirming they fit an identifier, meanwhile having a '.+' rule to collect function body.

A compromise would be to collect the function as a whole (everything inside of function and end function) and do a second parse on that block to parse the function header (name + params), trying to avoid.

Another idea would be to have an additional mode which goes into "FUNCTION_BODY_MODE" once it encounters an R_PAREN, and pop out (twice) once it finds end function. This way, anything between R_PAREN and end function is the function's body, inside that higher level mode I can have a greedy rule.

Something like

FunctionStart:       FUNCTION_START_FRAGMENT-> channel(SkipChannel), pushMode(IN_FUNCTION);

FunctionBodyStart:       R_PAREN_FRAGMENT -> channel(SkipChannel), pushMode(IN_FUNCTION_BODY);

FunctionBodyAndFunctionEnd : FUNCTION_END_FRAGMENT -> channel(SkipChannel), popMode, popMode; // double pop
ALL_TEXT : . ; // will consume everything

My issue with the above is it just sounds extremely counter-intuitive, and I am very new with ANTLR parsers so just trying to get the best advice for doing what fits my purposes.


  • Instead of pushing modes, I'd just use mode(...) to switch to another mode. This means you need not pop modes, making it a bit easier to understand what's going on.

    I'd go for something like this:


    lexer grammar ExprLexer;
    ScriptStart : '<%' -> mode(Script);
    GenericText : . ;
    fragment Spaces : [ \r\n\t]+;
    fragment Id     : [a-zA-Z0-9_\u0080-\ufffe]+;
    mode Script;
     ScriptEnd  : '%>' -> mode(DEFAULT_MODE);
     Comment    : '\'' ~[\r\n]* -> skip;
     Function   : 'function' -> mode(FunctionDeclaration);
     ScriptText : . ;
    mode FunctionDeclaration;
     FunctionName      : Id;
     DeclarationSpaces : Spaces+ -> skip;
     OPar              : '(' -> mode(FunctionParameter);
    mode FunctionParameter;
     ParameterName   : Id;
     ParameterSpaces : Spaces+ -> skip;
     Comma           : ',';
     CPar            : ')' -> mode(InFunction);
    mode InFunction;
     EndFunction    : 'end' Spaces 'function' -> mode(Script);
     FunctionSpaces : Spaces+ -> skip;
     FunctionText   : . ;


    parser grammar ExprParser;
    options { tokenVocab=ExprLexer; }
     : block* EOF
     : plainText
     | ScriptStart script* ScriptEnd
     : GenericText+
     : ScriptText+
     | function
     : Function FunctionName OPar parameters? CPar functionBody EndFunction
     : FunctionText*
     : ParameterName ( Comma ParameterName )*

    which will parse your input:

    arbitrary script lines1
    arbitrary script lines2
    function x(a,b)
       arbitrary script body containing anything
    end function
    arbitrary script lines3 again
    plain text
    arbitrary script lines4 again
    function y()
        different function body
    end function

    like this:

        (plainText t e x t 1 \n)) 
      (block <% 
        (script \n a r b i t r a r y   s c r i p t   l i n e s 1 \n a r b i t r a r y   s c r i p t   l i n e s 2 \n \n) 
          (function function x ( (parameters a , b) ) 
            (functionBody a r b i t r a r y s c r i p t b o d y c o n t a i n i n g a n y t h i n g) end function)) 
        (script \n \n a r b i t r a r y   s c r i p t   l i n e s 3   a g a i n \n) %>) 
        (plainText \n p l a i n   t e x t \n)) 
      (block <% 
        (script \n a r b i t r a r y   s c r i p t   l i n e s 4   a g a i n \n \n) 
          (function function y ( ) 
            (functionBody d i f f e r e n t f u n c t i o n b o d y) end function)) 
        (script \n) %>) 
        (plainText \n M U)) <EOF>)