Search code examples
parsinggrammartopdown

Grammar and top down parsing


For a while I've been trying to learn how LL parsers work and, if I understand this correctly, when writing a top-down recursive descent parser by hand I should create a function for every non-terminal symbol. So for this example:

S -> AB
A -> aA|ε
B -> bg|dDe
D -> ab|cd

I'd have to create a function for each S, A, B and D like this:

Token B()
{
    if (Lookahead(1) == 'b')
    {
        Eat('b');
        Eat('g');
    }
    else if (Lookahead(1) == 'd')
    {
        Eat('d');
        D();
        Eat('e');
    }
    else
    {
        Error();
    }

    return B_TOKEN;
}

But then I try to do the same thing with the following grammar that I created to generate the same language as (a|b|c)*a regular expression:

S -> Ma
M -> aM|bM|cM|ε

That gives me following functions:

Token S()
{
    char Ch = Lookahead(1);
    if (Ch == 'a' || Ch == 'b' || Ch == 'c')
    {
        M();
        Eat('a');
    }
    else
    {
        Error();
    }

    return S_TOKEN;
}

Token M()
{
    char Ch = Lookahead(1);
    if (Ch == 'a' || Ch == 'b' || Ch == 'c')
    {
        Eat(ch);
        M();
    }

    return M_TOKEN;
}

And it doesn't seem to be good because for input 'bbcaa' M will consume everything, and after that S won't find the last 'a' and report an error, even though the grammar accepts it. It feels like M is missing the ε case, or maybe it is handled in the wrong way, but I'm not sure how to deal with it. Could anyone please help?


Solution

  • The behaviour of a top-down predictive parser is exactly as you note in your question. In other words, your second grammar is not suitable for top-down parsing (with one-token lookahead). Many grammars are not; that includes any grammar in which it is not possible to predict which production to use based on a finite lookahead.

    In this case, if you were to lookahead two tokens instead of one, you could solve the conflict; M should predict the ε production on lookahead aEND, and the aM production on all other two-token lookaheads where the first token is a.