I've just started writing a simple parser in C# that you can find here: https://github.com/JohnnyErnest/LexerParser/blob/main/Program.cs#L1078
The main feature being that I wanted to be able to write a JSON configuration for some grammar that I want to analyze, load up the configuration at runtime and evaluate some string input via the lexer/parser and generate a tree with information about each node for things like syntax highlighting and extracting variables from sections of a parsed text, rather than using a third party tool like a yacc/bison clone to create code to be compiled from a grammar, for some languages like SQL/CSS/HTML.
The simple question is, what are some other good methods of detecting recursion problems early from malformed input before a StackOverflowException? The details are below.
I don't have grammars fully built out yet for various languages, just simple tests of some basic grammar so far, but I ran across a problem when trying to implement EBNF like: https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form when having to deal with how the right-hand side of grammar statements are evaluated like so:
rhs = identifier
| terminal
| "[" , rhs , "]"
| "{" , rhs , "}"
| "(" , rhs , ")"
| rhs , "|" , rhs
| rhs , "," , rhs ;
So in my JSON configuration it is set up like the following: https://github.com/JohnnyErnest/LexerParser/blob/main/Configuration/ParserEBNF.json#L178 accounting also for white-spaces.
"rhsBlock1": {
"sequence": [
{
"tokenList": "whitespaces",
"isOptional": "true"
},
{ "tokenList": "bracketOpen" },
{
"tokenList": "whitespaces",
"isOptional": "true"
},
{ "sequenceList": "rhs" },
{
"tokenList": "whitespaces",
"isOptional": "true"
},
{ "tokenList": "bracketClose" },
{
"tokenList": "whitespaces",
"isOptional": "true"
}
]
},
"rhsBlock2": {
"sequence": [
{
"tokenList": "whitespaces",
"isOptional": "true"
},
{ "tokenList": "braceOpen" },
{
"tokenList": "whitespaces",
"isOptional": "true"
},
{ "sequenceList": "rhs" },
{
"tokenList": "whitespaces",
"isOptional": "true"
},
{ "tokenList": "braceClose" },
{
"tokenList": "whitespaces",
"isOptional": "true"
}
]
},
"rhsBlock3": {
"sequence": [
{
"tokenList": "whitespaces",
"isOptional": "true"
},
{ "tokenList": "parenthesisOpen" },
{
"tokenList": "whitespaces",
"isOptional": "true"
},
{ "sequenceList": "rhs" },
{
"tokenList": "whitespaces",
"isOptional": "true"
},
{ "tokenList": "parenthesisClose" },
{
"tokenList": "whitespaces",
"isOptional": "true"
}
]
},
"rhsBlockPipe": {
"sequence": [
{
"tokenList": "whitespaces",
"isOptional": "true"
},
{ "sequenceList": "rhs" },
{
"tokenList": "whitespaces",
"isOptional": "true"
},
{ "tokenList": "pipe" },
{
"tokenList": "whitespaces",
"isOptional": "true"
},
{ "sequenceList": "rhs" },
{
"tokenList": "whitespaces",
"isOptional": "true"
}
]
},
"rhsBlockComma": {
"sequence": [
{
"tokenList": "whitespaces",
"isOptional": "true"
},
{ "sequenceList": "rhs" },
{
"tokenList": "whitespaces",
"isOptional": "true"
},
{ "tokenList": "comma" },
{
"tokenList": "whitespaces",
"isOptional": "true"
},
{ "sequenceList": "rhs" },
{
"tokenList": "whitespaces",
"isOptional": "true"
}
]
},
"rhs": {
"sequence": [
{ "sequenceList": "identifier,ebnfTerminal,rhsBlock1,rhsBlock2,rhsBlock3,rhsBlockPipe,rhsBlockComma" }
]
},
Given that definition, the following works and evaluates true:
string inputEBNF = "a=\"5\";";
var ebnfParserResult = ebnfParser.Parse(inputEBNF, "grammar");
However, if you changed the "5" to a literal though, which is invalid EBNF, you won't get an error, you'll get infinite recursion.
string inputEBNF = "a=5;";
var ebnfParserResult = ebnfParser.Parse(inputEBNF, "grammar");
var exit = ebnfParser.CancelAllParsing;
The value of exit will be true, because the parser will go up to MaxLevel in recursion and I shut it off at that point, otherwise it would keep on going until a StackOverflowException.
It's when earlier blocks in the JSON config like rhsBlockComma call later blocks rhs that the problem starts.
In my code you'll see that Parser.Parse calls a method called Check: https://github.com/JohnnyErnest/LexerParser/blob/main/Program.cs#L1017
Check will then call CheckSequence: https://github.com/JohnnyErnest/LexerParser/blob/main/Program.cs#L954 which will check each section of a sequence, calling CheckSequenceSection https://github.com/JohnnyErnest/LexerParser/blob/main/Program.cs#L843
CheckSequenceSection looks through each lexer token in a JSON section's tokenList first, and then looks through each sequence in the section's sequenceList, and call CheckSequence on each sequence, which does a single recursion.
Normally that works fine if the input is valid. If not, a variable level is tracked and if it goes beyond MaxLevel then aborting occurs. I also needed to add a preemptive return before CheckSequence from CheckSequenceSection or else it would just keep doing recursions until a StackOverflowException.
What are some other methods to early detect recursion problems from malformed input?
Answer is explained better here: https://stackoverflow.com/a/20179940/5556724
The version of EBNF on Wikipedia is written like so:
rhs = identifier
| terminal
| "[" , rhs , "]"
| "{" , rhs , "}"
| "(" , rhs , ")"
| rhs , "|" , rhs
| rhs , "," , rhs ;
but should be written like so:
primary = identifier
| terminal
| "[" , rhs , "]"
| "{" , rhs , "}"
| "(" , rhs , ")"
;
factor = primary , { "|" , primary } ;
rhs = factor , { "," , factor } ;
otherwise there is a FIRST/FIRST conflict.