Search code examples
parsingstack-overflowantlrantlr4

Antlr parser StackOverflowException (for parsing regular expressions)


I made a simple grammar for parsing regular expressions. Unfortunately, when I try to test my regex compiler on large expressions I reach StackOverflowException. The problem is similar to this one except that their solution no longer works in my scenario. Here is my grammar:

union: concat | concat '|' union ;
concat: kleene_closure concat | kleene_closure;
kleene_closure: atomic '*' | atomic ;
atomic : '(' union ')' | LETTER ;

Now the problem is that I have a really large file that looks like

something1 | something2 | something3 | .... | something1000

I use ANTLR's Visitor class for parsing. I know I could probably make some optimization by using +/* like this

union: (concat '|')* concat ;
concat: kleene_closure+;
kleene_closure: atomic '*' | atomic ;
atomic : '(' union ')' | LETTER ;

However, it doesn't really solve the problem, due to recursive nature of this grammar. For instance, it would now fail on the following sample that clearly requires recursion:

(...(((something1) | something2) | something3) | .... ) | something1000

How can I avoid StackOverflowExcpetion? How do other compilers, like for instance C compiler deal with really large texts that have thousands lines of code?


Solution

  • If you're going to use a recursive descent parser, then you will inevitably run into an input which exceeds the call stack depth. This problem is ameliorated by languages like Java which are capable of controlling their own stack depth, so that there is a controllable result like a StackOverflowException. But it's still a real problem.

    Parser generators like Yacc/Bison and Java Cup use a bottom-up LALR(1) algorithm which uses an explicit stack for temporary storage, rather than using the call stack for that purpose. That means that the parsers have to manage storage for the parser stack (or use a container ADT from the host language's standard library, if there is one), which is slightly more complex. But you don't have to deal with that complexity; it's built in to the parser generator.

    There are several advantages of the explicit stack for the parser generator:

    • It's easier to control maximum stack size;
    • The maximum stack size is (usually) only limited by available memory;
    • It's probably more memory efficient because control flow information doesn't need to be kept in stack frames.

    Still, it's not a panacea. A sufficiently complicated expression will exceed any fixed stack size, and that can be lead to certain programs being unparseable. Furthermore, if you take advantage of the flexibility mentioned in the second point above ("only limited by available memory"), you may well find that your compiler is terminated unceremoniously by an OOM process (or a segfault) rather than being able to respond to a more polite out-of-memory exception (depending on OS and configuration, of course).

    As to:

    How do other compilers, like for instance C compiler deal with really large texts that have thousands lines of code?

    Having thousands of lines of code is not a problem if you use a repetition operator in your grammar (or, in the case that you are using an LALR(1) parser, that your grammar is left-recursive). The problem arises, as you note in your question, when you have texts with thousands of nested blocks. And the answer is that many C compilers don't deal gracefully with such texts. Here's a simple experiment with gcc:

    $ # A function which generates deeply-nested C programs
    $ type deep
    deep is a function
    deep () { 
        n=$1;
        printf "%s\n%s\n %s\n" '#include <stdio.h>' 'int main(void) {' 'int a0 = 0;';
        for ((i=0; i<n; ++i))
        do
            printf '%*s{ int a%d = a%d + 1;\n' $((i+1)) '' $((i+1)) $i;
        done;
        printf '%*sprintf("%%d\\n", a%d);\n' $n '' $n;
        for ((i=0; i<n; ++i))
        do
            printf "%s" '}';
        done;
        printf "%s\n" '}'
    }
    $ deep 3
    #include <stdio.h>
    int main(void) {
     int a0 = 0;
     { int a1 = a0 + 1;
      { int a2 = a1 + 1;
       { int a3 = a2 + 1;
       printf("%d\n", a3);
    }}}}
    $ # For small depths, GCC is OK with that.
    $ deep 3 | gcc -x c - && ./a.out
    3
    $ # Let's go deeper:
    $ deep 10 | gcc -x c - && ./a.out
    10
    $ deep 100 | gcc -x c - && ./a.out
    100
    $ deep 1000 | gcc -x c - && ./a.out
    1000
    $ deep 10000 | gcc -x c - && ./a.out
    10000
    $ # Ka-bang. (Took quite a long time, too.)
    $ deep 100000 | gcc -x c - && ./a.out
    gcc: internal compiler error: Segmentation fault (program cc1)
    Please submit a full bug report,
    with preprocessed source if appropriate.
    See <file:///usr/share/doc/gcc-7/README.Bugs> for instructions.
    

    Without the nested blocks, gcc is still slow but can handle the program:

    $ type big
    big is a function
    big () 
    { 
        n=$1;
        printf "%s\n%s\n %s\n" '#include <stdio.h>' 'int main(void) {' 'int a0 = 0;';
        for ((i=0; i<n; ++i))
        do
            printf ' int a%d = a%d + 1;\n' $((i+1)) $i;
        done;
        printf ' printf("%%d\\n", a%d);\n' $n;
        printf "%s\n" '}'
    }
    $ big 3
    #include <stdio.h>
    int main(void) {
     int a0 = 0;
     int a1 = a0 + 1;
     int a2 = a1 + 1;
     int a3 = a2 + 1;
     printf("%d\n", a3);
    }
    $ $ big 3|gcc -x c - && ./a.out
    3
    $ big 10000|gcc -x c - && ./a.out
    10000
    $ big 100000|gcc -x c - && ./a.out
    100000