Search code examples
c#parsinglex

C# writing a first parser, early stackoverflow detection


I've just started writing a simple parser in C# that you can find here: https://github.com/JohnnyErnest/LexerParser/blob/main/Program.cs#L1078

The main feature being that I wanted to be able to write a JSON configuration for some grammar that I want to analyze, load up the configuration at runtime and evaluate some string input via the lexer/parser and generate a tree with information about each node for things like syntax highlighting and extracting variables from sections of a parsed text, rather than using a third party tool like a yacc/bison clone to create code to be compiled from a grammar, for some languages like SQL/CSS/HTML.

The simple question is, what are some other good methods of detecting recursion problems early from malformed input before a StackOverflowException? The details are below.

I don't have grammars fully built out yet for various languages, just simple tests of some basic grammar so far, but I ran across a problem when trying to implement EBNF like: https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form when having to deal with how the right-hand side of grammar statements are evaluated like so:

rhs = identifier
     | terminal
     | "[" , rhs , "]"
     | "{" , rhs , "}"
     | "(" , rhs , ")"
     | rhs , "|" , rhs
     | rhs , "," , rhs ;

So in my JSON configuration it is set up like the following: https://github.com/JohnnyErnest/LexerParser/blob/main/Configuration/ParserEBNF.json#L178 accounting also for white-spaces.

"rhsBlock1": {
  "sequence": [
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    },
    { "tokenList": "bracketOpen" },
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    },
    { "sequenceList": "rhs" },
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    },
    { "tokenList": "bracketClose" },
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    }
  ]
},
"rhsBlock2": {
  "sequence": [
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    },
    { "tokenList": "braceOpen" },
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    },
    { "sequenceList": "rhs" },
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    },
    { "tokenList": "braceClose" },
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    }
  ]
},
"rhsBlock3": {
  "sequence": [
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    },
    { "tokenList": "parenthesisOpen" },
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    },
    { "sequenceList": "rhs" },
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    },
    { "tokenList": "parenthesisClose" },
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    }
  ]
},
"rhsBlockPipe": {
  "sequence": [
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    },
    { "sequenceList": "rhs" },
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    },
    { "tokenList": "pipe" },
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    },
    { "sequenceList": "rhs" },
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    }
  ]
},
"rhsBlockComma": {
  "sequence": [
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    },
    { "sequenceList": "rhs" },
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    },
    { "tokenList": "comma" },
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    },
    { "sequenceList": "rhs" },
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    }
  ]
},
"rhs": {
  "sequence": [
    { "sequenceList": "identifier,ebnfTerminal,rhsBlock1,rhsBlock2,rhsBlock3,rhsBlockPipe,rhsBlockComma" }
  ]
},

Given that definition, the following works and evaluates true:

        string inputEBNF = "a=\"5\";";
        var ebnfParserResult = ebnfParser.Parse(inputEBNF, "grammar");

However, if you changed the "5" to a literal though, which is invalid EBNF, you won't get an error, you'll get infinite recursion.

        string inputEBNF = "a=5;";
        var ebnfParserResult = ebnfParser.Parse(inputEBNF, "grammar");
        var exit = ebnfParser.CancelAllParsing;

The value of exit will be true, because the parser will go up to MaxLevel in recursion and I shut it off at that point, otherwise it would keep on going until a StackOverflowException.

It's when earlier blocks in the JSON config like rhsBlockComma call later blocks rhs that the problem starts.

In my code you'll see that Parser.Parse calls a method called Check: https://github.com/JohnnyErnest/LexerParser/blob/main/Program.cs#L1017

Check will then call CheckSequence: https://github.com/JohnnyErnest/LexerParser/blob/main/Program.cs#L954 which will check each section of a sequence, calling CheckSequenceSection https://github.com/JohnnyErnest/LexerParser/blob/main/Program.cs#L843

CheckSequenceSection looks through each lexer token in a JSON section's tokenList first, and then looks through each sequence in the section's sequenceList, and call CheckSequence on each sequence, which does a single recursion.

Normally that works fine if the input is valid. If not, a variable level is tracked and if it goes beyond MaxLevel then aborting occurs. I also needed to add a preemptive return before CheckSequence from CheckSequenceSection or else it would just keep doing recursions until a StackOverflowException.

What are some other methods to early detect recursion problems from malformed input?


Solution

  • Answer is explained better here: https://stackoverflow.com/a/20179940/5556724

    The version of EBNF on Wikipedia is written like so:

    rhs = identifier
         | terminal
         | "[" , rhs , "]"
         | "{" , rhs , "}"
         | "(" , rhs , ")"
         | rhs , "|" , rhs
         | rhs , "," , rhs ;
    

    but should be written like so:

    primary = identifier
            | terminal
            | "[" , rhs , "]"
            | "{" , rhs , "}"
            | "(" , rhs , ")"
            ;
    factor  = primary , { "|" , primary } ;
    rhs     = factor ,  { "," , factor } ;
    

    otherwise there is a FIRST/FIRST conflict.