For a while I've been trying to learn how LL parsers work and, if I understand this correctly, when writing a top-down recursive descent parser by hand I should create a function for every non-terminal symbol. So for this example:
S -> AB
A -> aA|ε
B -> bg|dDe
D -> ab|cd
I'd have to create a function for each S, A, B and D like this:
Token B()
{
if (Lookahead(1) == 'b')
{
Eat('b');
Eat('g');
}
else if (Lookahead(1) == 'd')
{
Eat('d');
D();
Eat('e');
}
else
{
Error();
}
return B_TOKEN;
}
But then I try to do the same thing with the following grammar that I created to generate the same language as (a|b|c)*a regular expression:
S -> Ma
M -> aM|bM|cM|ε
That gives me following functions:
Token S()
{
char Ch = Lookahead(1);
if (Ch == 'a' || Ch == 'b' || Ch == 'c')
{
M();
Eat('a');
}
else
{
Error();
}
return S_TOKEN;
}
Token M()
{
char Ch = Lookahead(1);
if (Ch == 'a' || Ch == 'b' || Ch == 'c')
{
Eat(ch);
M();
}
return M_TOKEN;
}
And it doesn't seem to be good because for input 'bbcaa' M will consume everything, and after that S won't find the last 'a' and report an error, even though the grammar accepts it. It feels like M is missing the ε case, or maybe it is handled in the wrong way, but I'm not sure how to deal with it. Could anyone please help?
The behaviour of a top-down predictive parser is exactly as you note in your question. In other words, your second grammar is not suitable for top-down parsing (with one-token lookahead). Many grammars are not; that includes any grammar in which it is not possible to predict which production to use based on a finite lookahead.
In this case, if you were to lookahead two tokens instead of one, you could solve the conflict; M
should predict the ε
production on lookahead aEND, and the aM
production on all other two-token lookaheads where the first token is a.