Search code examples
parsingcontext-free-grammarll-grammar

Purpose of FIRST and FOLLOW sets in LL(1) parsers?


Can anyone explain to me how FIRST and FOLLOW should be used in LL(1) grammar? I understand that they are used for syntax table construction, but I don't understand how.


Solution

  • In an LL(1) parser, the parser works by maintaining a workspace initially seeded to the start symbol followed by the end-of-string marker (usually denoted $). At each step, it does one of the following:

    • If the first symbol of the workspace is a terminal, it matches it against the next token of input (or reports an error if it doesn't match.)
    • If the first symbol of the workspace is a nonterminal, it predicts what production to replace that nonterminal with.

    The predict step is where FIRST and FOLLOW show up. The parser needs to be able to guess, based purely on the current nonterminal and the next token of input, which production to use. The question is how to do this.

    Let's suppose that the current nonterminal is A and the next token of input is t. If you know the productions of A, which one would you choose to apply? There's one simple case to consider: if there's a production of the form A → tω, where ω is some arbitrary string, then you should pick that production because the t you're looking at as input will match the t at the front of the production.

    There are also some complex cases to consider. Suppose you have a production of the form A → Bω, where B is a nonterminal and ω is some string. Under what circumstances would you want to guess this production? Well, if you know that the next terminal symbol is a t, you wouldn't want to guess this production unless you knew that B can expand to a string that starts with the terminal t (there's another case that we'll talk about in a second). This is where FIRST sets come in. In grammars without ε productions, the set FIRST(X) for some nonterminal X is the set of all terminals that can potentially appear at the start of some string derived from X. If you have a production of the form A → Bω and you see the nonterminal t, you'd guess to use that production precisely when t ∈ FIRST(B); that is, B can derive some string that starts with t. If B doesn't derive anything starting with t, then there's no reason to choose it, and if B does derive something starting with t, you'd want to make this choice so that you could eventually match the t against it.

    Things get a bit trickier when ε productions are introduced. Now, let's suppose that you have a production of the form A → BCω, where B and C are nonterminals and ω is a string. Let's also suppose the next token of input is t. If t ∈ FIRST(B), then we'd choose this production, as mentioned above. However, what happens if t ∉ FIRST(B)? If there are ε productions in the grammar, we might still want to choose this production if B can derive ε and t ∈ FIRST(C). Why is this? If this happens, it means that we might be able to match the t by producing BCω, then producing ε from B, leaving Cω against which to match the t. This is one context where we might have to "look through" a nonterminal. Fortunately, this is handled by FIRST sets. If a nonterminal X can produce ε, then ε ∈ FIRST(X). Therefore, we can use FIRST sets to check whether we need to "look through" a nonterminal by seeing whether ε ∈ FIRST(X).

    So far we haven't talked about FOLLOW sets. Where do they come in? Well, suppose that we're processing the nonterminal A, we see the terminal t, but none of the productions for A can actually consume the t. What do we do then? It turns out there's still a way that we can eat up that t. Remember that LL(1) parsers work by maintaining a workspace with a string in it. It's possible that the t we're looking at is not supposed to be matched against the current nonterminal A at all, and instead we're supposed to have A produce ε and then let some later nonterminal in the workspace match against the t. This is where FOLLOW sets come in. The FOLLOW set of a nonterminal X, denoted FOLLOW(X), is the set of all terminal symbols that can appear immediately after X in some derivation. When choosing which production to choose for A, we add in a final rule - if the terminal symbol t is in the FOLLOW set of A, we choose some production that ultimately will produce ε. That way, the A will "disappear" and we can match the t against some character that appears after the A nonterminal.

    This isn't a complete introduction to LL(1) parsing, but I hope it helps you see why we need FIRST and FOLLOW sets. For more information, pick up a book on parsing (I recommend Parsing Techniques: A Practical Guide by Grune and Jacobs) or take a course on compilers. As a totally shameless plug, I taught a compilers course in Summer 2012-2013 and all of the lecture slides are available online.

    Hope this helps!