Search code examples
antlrantlr4grammar

Match anything until end tag (generic text) in simple lexer/parser using ANTLR4


I want to make a simple parser for a simple scripting language, it has text blocks and script blocks, inside those scriptblocks, I want to be able to define a function, as well as execute generic statements of any kind.

I don't really need to know or care what classifies as a "statement", but I do need to parse for function declarations. So even if it looks like a while loop and I don't have a rule for a while loop, can I match a "generic statement rule" and just get the content some how?

Using a catchall rule I am able to do the "generic text" part fine, but in script mode I'm less successful, I tried pulling off nested modes where I set an 'IN FUNCTION' mode, but kept running into road blocks.

For example, when inside a statement within my functionDeclaration , how can I match everything until the end function. Furthermore, how can I just match a "generic" statement, such that I do not ever need statement types like emptyStatement or assignmentStatement. Even if it just becomes a big "script code blob" that's fine with me.

Where I am so far:

My Grammar:

parser grammar ExprParser;
options { tokenVocab=ExprLexer; }

file
    : block* EOF
    ;
    
block
    : textBlock+
    | script
    ;
    
textBlock
    : HtmlDtd
    | GenericText
    | ScriptEnd
    ;
    
script
    : topStatement+
    | statement
    ;
    
topStatement
    : functionDeclaration
    ;

functionDeclaration
    : FunctionStart Ident L_PAREN R_PAREN statement* FunctionEnd
    ;


statement
    : assignmentStatement
    | emptyStatement
    ;
    
assignmentStatement
    : Ident ASSIGNTO Ident SEMICOLON
    ;
    
emptyStatement
    : SEMICOLON
    ;

My Lexer

lexer grammar ExprLexer;

channels { Comments, SkipChannel }



SeaWhitespace:  [ \t\r\n\f]+ -> channel(HIDDEN);
HtmlDtd:        '<!' .*? '>';
ScriptStart:       SCRIPT_START_FRAGMENT -> channel(SkipChannel), pushMode(SCRIPT);

// Catch all text
GenericText : . ; 

mode SCRIPT;
ScriptEnd :'%' '>' -> channel(SkipChannel), popMode;
ScriptWhitespace : [ \t\r\n\f]+ -> channel(SkipChannel);

// Comments begin with single quote
ScriptSingleLineComment:  '\'' -> channel(SkipChannel), pushMode(SingleLineCommentMode);
    
FunctionStart :  FUNCTION_START_FRAGMENT;
FunctionEnd : FUNCTION_END_FRAGMENT;
Ident : ID;

COMMA     : ',';
SEMICOLON : ';';
L_PAREN   : '(';
R_PAREN   : ')';
ASSIGNTO  : '=';

mode SingleLineCommentMode;
Comment:                 ~[\r\n?]+ -> channel(Comments);
CommentEnd:              [\r\n] -> channel(SkipChannel), popMode; // exit from comment.


// Fragments
fragment ID: [a-zA-Z0-9_\u0080-\ufffe]+;
fragment NameString: [a-zA-Z_\u0080-\ufffe][a-zA-Z0-9_\u0080-\ufffe]*;
fragment SCRIPT_START_FRAGMENT : '<%';
fragment SCRIPT_END_FRAGMENT : '%>';
fragment FUNCTION_START_FRAGMENT : 'function';
fragment FUNCTION_END_FRAGMENT : 'end function'; // Space is required here

Some test strings

<! tagsIknow >
<tagsIdontknowbutwant>
<%
function xxx() 'this is a comment 

  x = y;
  a = 1;
  ;
  ;

end function

a = 1;
b = 2;

%>
randomtext
<%

  'another script
  x = 3; 'inline comment again
%>

The kind of script I want to work with

blah
<%

function xxx() 
   while (true) ' notice I have no rule for a while loop
     get me everything in here verbatim except for comments ' this ideally is trimmed
   endwhile 
end function ' I want everything until the 'end function' keyword, basically

%>

more generic text

EDIT:

My goal is for input like this

text1
<%

arbitrary script lines1
arbitrary script lines2


function x(a,b) 
   arbitrary script body containing anything
end function

arbitrary script lines3 again
%>
plain text
<%
arbitrary script lines4 again

function y() 
    different function body
end function
%>

So I get this:

PLAIN_TEXT_BLOB (matching TEXT1)
SCRIPT_BLOB (matching script lines 1 & 2 together)
FUNCTION
  name: x
  params: [a, b]
  body: SCRIPT_BLOB (containing the body)
SCRIPT_BLOB (matching line 3)
PLAIN_TEXT_BLOB (matching 'plain text')
SCRIPT_BLOB (matching line 4)
FUNCTION
  name: y
  params: []
  body: SCRIPT_BLOB (containing the body)
EOF

So in theory just three "types", plain texts, script objects (multiple lines), and functions (which themselves contain some params and a single script object)

Such that given the above objects I can maintain order which I encountered and handle appropriately, pushing "PLAIN TEXT" out raw, running "non-function scripts" in order, and declaring functions in order.

The problem is I cannot seem to capture things like the function name or the parameters while I have a greedy rule (this is due to ANTLR overriding those rules with most greedy one), so I cannot have a rule for paramters which is confirming they fit an identifier, meanwhile having a '.+' rule to collect function body.

A compromise would be to collect the function as a whole (everything inside of function and end function) and do a second parse on that block to parse the function header (name + params), trying to avoid.

Another idea would be to have an additional mode which goes into "FUNCTION_BODY_MODE" once it encounters an R_PAREN, and pop out (twice) once it finds end function. This way, anything between R_PAREN and end function is the function's body, inside that higher level mode I can have a greedy rule.

Something like

FunctionStart:       FUNCTION_START_FRAGMENT-> channel(SkipChannel), pushMode(IN_FUNCTION);

mode IN_FUNCTION;
FunctionBodyStart:       R_PAREN_FRAGMENT -> channel(SkipChannel), pushMode(IN_FUNCTION_BODY);

mode IN_FUNCTION_BODY;
FunctionBodyAndFunctionEnd : FUNCTION_END_FRAGMENT -> channel(SkipChannel), popMode, popMode; // double pop
ALL_TEXT : . ; // will consume everything

My issue with the above is it just sounds extremely counter-intuitive, and I am very new with ANTLR parsers so just trying to get the best advice for doing what fits my purposes.


Solution

  • Instead of pushing modes, I'd just use mode(...) to switch to another mode. This means you need not pop modes, making it a bit easier to understand what's going on.

    I'd go for something like this:

    ExprLexer.g4

    lexer grammar ExprLexer;
    
    ScriptStart : '<%' -> mode(Script);
    GenericText : . ;
    
    fragment Spaces : [ \r\n\t]+;
    fragment Id     : [a-zA-Z0-9_\u0080-\ufffe]+;
    
    mode Script;
    
     ScriptEnd  : '%>' -> mode(DEFAULT_MODE);
     Comment    : '\'' ~[\r\n]* -> skip;
     Function   : 'function' -> mode(FunctionDeclaration);
     ScriptText : . ;
    
    mode FunctionDeclaration;
    
     FunctionName      : Id;
     DeclarationSpaces : Spaces+ -> skip;
     OPar              : '(' -> mode(FunctionParameter);
    
    mode FunctionParameter;
    
     ParameterName   : Id;
     ParameterSpaces : Spaces+ -> skip;
     Comma           : ',';
     CPar            : ')' -> mode(InFunction);
    
    mode InFunction;
    
     EndFunction    : 'end' Spaces 'function' -> mode(Script);
     FunctionSpaces : Spaces+ -> skip;
     FunctionText   : . ;
    

    ExprParser.g4

    parser grammar ExprParser;
    
    options { tokenVocab=ExprLexer; }
    
    file
     : block* EOF
     ;
    
    block
     : plainText
     | ScriptStart script* ScriptEnd
     ;
    
    plainText
     : GenericText+
     ;
    
    script
     : ScriptText+
     | function
     ;
    
    function
     : Function FunctionName OPar parameters? CPar functionBody EndFunction
     ;
    
    functionBody
     : FunctionText*
     ;
    
    parameters
     : ParameterName ( Comma ParameterName )*
     ;
    

    which will parse your input:

    text1
    <%
    arbitrary script lines1
    arbitrary script lines2
    
    function x(a,b)
       arbitrary script body containing anything
    end function
    
    arbitrary script lines3 again
    %>
    plain text
    <%
    arbitrary script lines4 again
    
    function y()
        different function body
    end function
    %>
    MU
    

    like this:

    (file 
      (block 
        (plainText t e x t 1 \n)) 
      (block <% 
        (script \n a r b i t r a r y   s c r i p t   l i n e s 1 \n a r b i t r a r y   s c r i p t   l i n e s 2 \n \n) 
        (script 
          (function function x ( (parameters a , b) ) 
            (functionBody a r b i t r a r y s c r i p t b o d y c o n t a i n i n g a n y t h i n g) end function)) 
        (script \n \n a r b i t r a r y   s c r i p t   l i n e s 3   a g a i n \n) %>) 
      (block 
        (plainText \n p l a i n   t e x t \n)) 
      (block <% 
        (script \n a r b i t r a r y   s c r i p t   l i n e s 4   a g a i n \n \n) 
        (script 
          (function function y ( ) 
            (functionBody d i f f e r e n t f u n c t i o n b o d y) end function)) 
        (script \n) %>) 
      (block 
        (plainText \n M U)) <EOF>)