This is a follow up question to this question answered perfectly by Bart
My goal is possibly to get specific lines for either "generic script lines" or "lines inside a function body", ideally discarding whitespace, but still get any lines outside of the <%
and %>
tags in bulk. I came up with a solution, but looking at the tree it just seems messy.
Here is my lexer:
lexer grammar CmScriptLexer;
//Whitespace: Spaces -> channel(HIDDEN);
ScriptStart : '<%' (Spaces)* -> mode(Script);
SpacesPlain : [\r\n]+ -> skip;
GenericText : . ;
mode Script;
ScriptEnd : '%>' -> mode(DEFAULT_MODE);
Comment : '\'' ~[\r\n]* -> skip;
Function : 'function' -> mode(FunctionDeclaration);
NL : [\r\n]+;
ScriptText : . ;
mode FunctionDeclaration;
FunctionComment : '\'' ~[\r\n]* -> skip;
FunctionName : Id;
DeclarationSpaces : Spaces+ -> skip;
OPar : '(' -> mode(FunctionParameter);
mode FunctionParameter;
FunctionParameterComment : '\'' ~[\r\n]* -> skip;
ParameterName : Id;
ParameterSpaces : Spaces+ -> skip;
Comma : ',';
CPar : ')' -> mode(InFunction);
mode InFunction;
FunctionBodyComment : '\'' ~[\r\n]* -> skip;
EndFunction : 'end' Spaces 'function' -> mode(Script);
FunctionLine : ~[ \r\n]+;
FunctionSpaces : Spaces+;
//FunctionText : . ;
fragment Spaces : [ \r\n\t]+;
fragment Id : [a-zA-Z0-9_\u0080-\ufffe]+;
and my parser:
parser grammar CmScriptParser;
options { tokenVocab=CmScriptLexer; }
file
: block* EOF
;
block
: plainText
| ScriptStart script* ScriptEnd
;
plainText
: GenericText+ NL*
;
script
: simpleScript NL*
| function NL*
;
simpleScript
: ScriptText+
;
function
: Function FunctionName OPar parameters? CPar functionBody EndFunction
;
functionBody
: functionLines+
;
functionLines
: FunctionSpaces* functionLine FunctionSpaces*
;
functionLine
: FunctionLine+
;
parameters
: ParameterName ( Comma ParameterName )*
;
and finally what I'm using as a test case:
foo
bar
<%
line 1
line 2
function x(y)
spanning
multiple
lines
end function
function a(b) no newlines end function
%>
baz
My issue is it seems really verbose and I fear my "solution" while with the test case is just poorly laid out and I'm maybe overthinking rules.
Any suggestion on how to improve? All I want is trimmed "line" elements so matching something like \n \n\n\tscript line \n\n\t\n
being resulted in a line of just script line
is ideal.
EDIT: adding what I think is an example of what I am after, again, maybe not expressing the best way possible:
simpleScript:
scriptLine: line1
scriptLine: line2
function:
name: x
parameters:
paramter: y
body:
functionLine: spanning
functionLine: multiple
functionLine: lines
function:
name: a
parameters:
paramter: b
body:
functionLine: no newlines
The goal in the end is when walking the tree, I can make a new "function call object", and call stuff like
script = new Script() // on script "enter"
script.addLine("line 1")
script.addLine("line 2")
program.addNode(script) // on script "exit"
...
function = new Function() // on function "enter"
function.setName("y") // on "function"?
...
function.addParameter("a") // on "parameter"
...
function.addBodyLine("spanning") // on "line" ??
function.addBodyLine("multiple")
function.addBodyLine("lines")
...
program.addFunctionDeclaration(function) // on function "exit" once complete
The problem is that inside a script, you cannot simply tell the grammar to match some non-space followed by everything except line breaks. Sure, that would match line 1
, but that would also match function x(y)
because the lexer matches greedily (it tries to consume as many characters as possible). You must therefor chop up the tokens on white spaces.
You could merge some single char tokens using ~[ \t\r\n]+
, but you cannot create tokens that cause multiple words with spaces in between to be matched as single tokens.
Something like this:
lexer grammar CmScriptLexer;
ScriptStart : '<%' Spaces* -> mode(Script);
GenericText : ~[ \t\r\n]+;
TextSpaces : Spaces -> skip;
mode Script;
ScriptEnd : '%>' -> mode(DEFAULT_MODE);
Comment : '\'' ~[\r\n]* -> skip;
Function : 'function' -> mode(FunctionDeclaration);
NL : [\r\n]+;
ScriptText : ~[ \t\r\n]+;
SciptSpaces : Spaces -> skip;
mode FunctionDeclaration;
FunctionComment : '\'' ~[\r\n]* -> skip;
FunctionName : Id;
DeclarationSpaces : Spaces+ -> skip;
OPar : '(' -> mode(FunctionParameter);
mode FunctionParameter;
FunctionParameterComment : '\'' ~[\r\n]* -> skip;
ParameterName : Id;
ParameterSpaces : Spaces+ -> skip;
Comma : ',';
CPar : ')' -> mode(InFunction);
mode InFunction;
FunctionBodyComment : '\'' ~[\r\n]* -> skip;
EndFunction : 'end' Spaces 'function' -> mode(Script);
FunctionLine : ~[ \t\r\n]+;
FunctionSpaces : Spaces+ -> skip;
fragment Spaces : [ \r\n\t]+;
fragment Id : [a-zA-Z0-9_\u0080-\ufffe]+;