c recursion compiler-construction stack-overflow buffer-overflow

How to prevent stack overflow when dealing with long recursive productions in C?

Given a grammar, how can one avoid stack overflow problem when calculating FIRST and FOLLOW sets in C. The problem arose in my code when I had to recurse through a long production.

Example:

S->ABCD
A->aBc | epsilon
B->Bc
C->a | epsilon
D->B

That is just a grammar off-head. The recursion is as such:

S->A
C->A
A->B
B->D
D->aBc | epsilon

FIRST(S)=FIRST(A)=FIRST(B)=FIRST(D)={a,epsilon}.  
Provide a C (not C++) code that calculates and print FIRST and FOLLOW set of the grammar above keeping in mind that you might encounter a longer grammar that has multiple implicit first/follow sets of a particular non-terminal.

For example:

FIRST(A)=FIRST(B)=FIRST(B)=FIRST(C)=FIRST(D)=FIRST(E)=FIRST(F)=FIRST(G)=FIRST(H)=FIRST(I)=FIRST(J)=FIRST(K)={k,l,epsilon}.

That is: for you to get FIRST(A) you have to calculate FIRST(B) and so on until you get to FIRST(K) that has its FIRST(K) has terminals 'k','l', and epsilon. The longer the implication, the more likely you will encounter stack-overflow due to multiple recursion.
How can this be avoided in C language and yet still get the correct output?
Explain with a C (not C++) code.

char*first(int i)
{
    int j,k=0,x;
    char temp[500], *str;
    for(j=0;grammar[i][j]!=NULL;j++)
    {
        if(islower(grammar[i][j][0]) || grammar[i][j][0]=='#' || grammar[i][j][0]==' ')
        {
           temp[k]=grammar[i][j][0];
           temp[k+1]='\0';
        }
        else
        {
            if(grammar[i][j][0]==terminals[i])
            {
                temp[k]=' ';
                temp[k+1]='\0';
            }
            else
            {
                x=hashValue(grammar[i][j][0]);
                str=first(x);
                strncat(temp,str,strlen(str));
            }
        }
        k++;
    }
    return temp;
}

My code goes to stack overflow. How can I avoid it?

Solution

Your program is overflowing the stack not because the grammar is "too complex" but rather because it is left-recursive. Since your program does not check to see if it has already recursed through a non-terminal, once it tries to compute first('B'), it will enter an infinite recursion, which will eventually fill the call stack. (In the example grammar, not only is B left-recursive, it is also useless because it has no non-recursive production, which means that it can never derive a sentence consisting only of terminals.)

That's not the only problem, though. The program suffers from at least two other flaws:

It does not check if a given terminal has already been added to the FIRST set for a non-terminal before adding the terminal to the set. Consequently, there will be repeated terminals in the FIRST sets.
The program only checks the first symbol in the right-hand side. However, if a non-terminal can produce ε (in other words, the non-terminal is nullable), the following symbol needs to be used as well to compute the FIRST set.

For example,
```
A → B C d
B → b | ε
C → c | ε
```
Here, FIRST(A) is {b, c, d}. (And similarly, FOLLOW(B) is {c, d}.)

Recursion doesn't help much with the computation of FIRST and FOLLOW sets. The simplest algorithm to describe is the this one, similar to the algorithm presented in the Dragon Book, which will suffice for any practical grammar:

For each non-terminal, compute whether it is nullable.
Using the above, initialize FIRST(N) for each non-terminal N to the set of leading symbols for each production for N. A symbol is a leading symbol for a production if it is either the first symbol in the right-hand side or if every symbol to its left is nullable. (These sets will contain both terminals and non-terminals; don't worry about that for now.)
Do the following until no FIRST set is changed during the loop:
- For each non-terminal N, for each non-terminal M in FIRST(N), add every element in FIRST(M) to FIRST(N) (unless, of course, it is already present).
Remove all the non-terminals from all the FIRST sets.

The above assumes that you have an algorithm for computing nullability. You'll find that algorithm in the Dragon Book as well; it is somewhat similar. Also, you should eliminate useless productions; the algorithm to detect them is very similar to the nullability algorithm.

There is an algorithm which is usually faster, and actually not much more complicated. Once you've completed step 1 of the above algorithm, you have computed the relation leads-with(N, V), which is true if and only if some production for the nonterminal N starts with the terminal or non-terminal V, possibly skipping over nullable non-terminals. FIRST(N) is then the transitive closure of leads-with with its domain restricted to terminals. That can be efficiently computed (without recursion) using the Floyd-Warshall algorithm, or using a variant of Tarjan's algorithm for computing strongly connected components of a graph. (See, for example, Esko Nuutila's transitive closure page.)