I want to make a simple parser for a simple scripting language, it has text blocks and script blocks, inside those scriptblocks, I want to be able to define a function, as well as execute generic statements of any kind.
I don't really need to know or care what classifies as a "statement", but I do need to parse for function declarations. So even if it looks like a while loop and I don't have a rule for a while loop, can I match a "generic statement rule" and just get the content some how?
Using a catchall rule I am able to do the "generic text" part fine, but in script mode I'm less successful, I tried pulling off nested modes where I set an 'IN FUNCTION' mode, but kept running into road blocks.
For example, when inside a statement
within my functionDeclaration
, how can I match everything until the end function
. Furthermore, how can I just match a "generic" statement, such that I do not ever need statement types like emptyStatement
or assignmentStatement
. Even if it just becomes a big "script code blob" that's fine with me.
Where I am so far:
My Grammar:
parser grammar ExprParser;
options { tokenVocab=ExprLexer; }
file
: block* EOF
;
block
: textBlock+
| script
;
textBlock
: HtmlDtd
| GenericText
| ScriptEnd
;
script
: topStatement+
| statement
;
topStatement
: functionDeclaration
;
functionDeclaration
: FunctionStart Ident L_PAREN R_PAREN statement* FunctionEnd
;
statement
: assignmentStatement
| emptyStatement
;
assignmentStatement
: Ident ASSIGNTO Ident SEMICOLON
;
emptyStatement
: SEMICOLON
;
My Lexer
lexer grammar ExprLexer;
channels { Comments, SkipChannel }
SeaWhitespace: [ \t\r\n\f]+ -> channel(HIDDEN);
HtmlDtd: '<!' .*? '>';
ScriptStart: SCRIPT_START_FRAGMENT -> channel(SkipChannel), pushMode(SCRIPT);
// Catch all text
GenericText : . ;
mode SCRIPT;
ScriptEnd :'%' '>' -> channel(SkipChannel), popMode;
ScriptWhitespace : [ \t\r\n\f]+ -> channel(SkipChannel);
// Comments begin with single quote
ScriptSingleLineComment: '\'' -> channel(SkipChannel), pushMode(SingleLineCommentMode);
FunctionStart : FUNCTION_START_FRAGMENT;
FunctionEnd : FUNCTION_END_FRAGMENT;
Ident : ID;
COMMA : ',';
SEMICOLON : ';';
L_PAREN : '(';
R_PAREN : ')';
ASSIGNTO : '=';
mode SingleLineCommentMode;
Comment: ~[\r\n?]+ -> channel(Comments);
CommentEnd: [\r\n] -> channel(SkipChannel), popMode; // exit from comment.
// Fragments
fragment ID: [a-zA-Z0-9_\u0080-\ufffe]+;
fragment NameString: [a-zA-Z_\u0080-\ufffe][a-zA-Z0-9_\u0080-\ufffe]*;
fragment SCRIPT_START_FRAGMENT : '<%';
fragment SCRIPT_END_FRAGMENT : '%>';
fragment FUNCTION_START_FRAGMENT : 'function';
fragment FUNCTION_END_FRAGMENT : 'end function'; // Space is required here
Some test strings
<! tagsIknow >
<tagsIdontknowbutwant>
<%
function xxx() 'this is a comment
x = y;
a = 1;
;
;
end function
a = 1;
b = 2;
%>
randomtext
<%
'another script
x = 3; 'inline comment again
%>
The kind of script I want to work with
blah
<%
function xxx()
while (true) ' notice I have no rule for a while loop
get me everything in here verbatim except for comments ' this ideally is trimmed
endwhile
end function ' I want everything until the 'end function' keyword, basically
%>
more generic text
EDIT:
My goal is for input like this
text1
<%
arbitrary script lines1
arbitrary script lines2
function x(a,b)
arbitrary script body containing anything
end function
arbitrary script lines3 again
%>
plain text
<%
arbitrary script lines4 again
function y()
different function body
end function
%>
So I get this:
PLAIN_TEXT_BLOB (matching TEXT1)
SCRIPT_BLOB (matching script lines 1 & 2 together)
FUNCTION
name: x
params: [a, b]
body: SCRIPT_BLOB (containing the body)
SCRIPT_BLOB (matching line 3)
PLAIN_TEXT_BLOB (matching 'plain text')
SCRIPT_BLOB (matching line 4)
FUNCTION
name: y
params: []
body: SCRIPT_BLOB (containing the body)
EOF
So in theory just three "types", plain texts, script objects (multiple lines), and functions (which themselves contain some params and a single script object)
Such that given the above objects I can maintain order which I encountered and handle appropriately, pushing "PLAIN TEXT" out raw, running "non-function scripts" in order, and declaring functions in order.
The problem is I cannot seem to capture things like the function name or the parameters while I have a greedy rule (this is due to ANTLR overriding those rules with most greedy one), so I cannot have a rule for paramters which is confirming they fit an identifier, meanwhile having a '.+' rule to collect function body.
A compromise would be to collect the function as a whole (everything inside of function
and end function
) and do a second parse on that block to parse the function header (name + params), trying to avoid.
Another idea would be to have an additional mode which goes into "FUNCTION_BODY_MODE" once it encounters an R_PAREN
, and pop out (twice) once it finds end function
. This way, anything between R_PAREN and end function
is the function's body, inside that higher level mode I can have a greedy rule.
Something like
FunctionStart: FUNCTION_START_FRAGMENT-> channel(SkipChannel), pushMode(IN_FUNCTION);
mode IN_FUNCTION;
FunctionBodyStart: R_PAREN_FRAGMENT -> channel(SkipChannel), pushMode(IN_FUNCTION_BODY);
mode IN_FUNCTION_BODY;
FunctionBodyAndFunctionEnd : FUNCTION_END_FRAGMENT -> channel(SkipChannel), popMode, popMode; // double pop
ALL_TEXT : . ; // will consume everything
My issue with the above is it just sounds extremely counter-intuitive, and I am very new with ANTLR parsers so just trying to get the best advice for doing what fits my purposes.
Instead of pushing modes, I'd just use mode(...)
to switch to another mode. This means you need not pop modes, making it a bit easier to understand what's going on.
I'd go for something like this:
lexer grammar ExprLexer;
ScriptStart : '<%' -> mode(Script);
GenericText : . ;
fragment Spaces : [ \r\n\t]+;
fragment Id : [a-zA-Z0-9_\u0080-\ufffe]+;
mode Script;
ScriptEnd : '%>' -> mode(DEFAULT_MODE);
Comment : '\'' ~[\r\n]* -> skip;
Function : 'function' -> mode(FunctionDeclaration);
ScriptText : . ;
mode FunctionDeclaration;
FunctionName : Id;
DeclarationSpaces : Spaces+ -> skip;
OPar : '(' -> mode(FunctionParameter);
mode FunctionParameter;
ParameterName : Id;
ParameterSpaces : Spaces+ -> skip;
Comma : ',';
CPar : ')' -> mode(InFunction);
mode InFunction;
EndFunction : 'end' Spaces 'function' -> mode(Script);
FunctionSpaces : Spaces+ -> skip;
FunctionText : . ;
parser grammar ExprParser;
options { tokenVocab=ExprLexer; }
file
: block* EOF
;
block
: plainText
| ScriptStart script* ScriptEnd
;
plainText
: GenericText+
;
script
: ScriptText+
| function
;
function
: Function FunctionName OPar parameters? CPar functionBody EndFunction
;
functionBody
: FunctionText*
;
parameters
: ParameterName ( Comma ParameterName )*
;
which will parse your input:
text1
<%
arbitrary script lines1
arbitrary script lines2
function x(a,b)
arbitrary script body containing anything
end function
arbitrary script lines3 again
%>
plain text
<%
arbitrary script lines4 again
function y()
different function body
end function
%>
MU
like this:
(file
(block
(plainText t e x t 1 \n))
(block <%
(script \n a r b i t r a r y s c r i p t l i n e s 1 \n a r b i t r a r y s c r i p t l i n e s 2 \n \n)
(script
(function function x ( (parameters a , b) )
(functionBody a r b i t r a r y s c r i p t b o d y c o n t a i n i n g a n y t h i n g) end function))
(script \n \n a r b i t r a r y s c r i p t l i n e s 3 a g a i n \n) %>)
(block
(plainText \n p l a i n t e x t \n))
(block <%
(script \n a r b i t r a r y s c r i p t l i n e s 4 a g a i n \n \n)
(script
(function function y ( )
(functionBody d i f f e r e n t f u n c t i o n b o d y) end function))
(script \n) %>)
(block
(plainText \n M U)) <EOF>)