Search code examples
parsingcompiler-construction

All viable prefixes of a Context Free Grammer


I am stuck to a problem from the famous dragon Book of Compiler Design.How to find all the viable prefixes of the following grammar:

S -> 0S1 | 01

The grammar is actually the language of the regex 0n1n. I presume the set of all viable prefixes might come as a regex too.I came up with the following solution

0+
0+S
0+S1
0+1
S

(By plus , I meant no of zeroes is 1..inf)

after reducing string 000111 with the following steps:

stack             input
                 000111
0                00111
00               0111 
000              111
0001             11
00S              11
00S1             1
0S               1
0S1              $
S                $

Is my solution correct or I am missing something?


Solution

  • 0n1n is not a regular language; regexen don't have variables like n and they cannot enforce an equal number of repetitions of two distinct subsequences. Nonetheless, for any context-free grammar, the set of viable prefixes is a regular language. (A proof of this fact, in some form, appears at the beginning of Part II of Donald Knuth's seminal 1965 paper, On the Translation of Languages from Left to Right, which demonstrated both a test for the LR(k) property and an algorithm for parsing LR(k) grammars in linear time.)

    OK, to the actual question. A viable prefix for a grammar is (by definition) the prefix of a sentential form which can appear on the stack during a parse using that grammar. It's called "viable" (which means "still alive" or "could continue growing") precisely because it must be the prefix of some right sentential form whose suffix contains no non-terminal symbol. In other words, there exists a sequence of terminals which can be appended to the viable prefix in order to produce a right-sentential form; the viable prefix can grow.

    Knuth shows how to create a DFA which produces all viable prefixes, but it's easier to see this DFA if we already have the LR(k) parser produced by an LR(k) algorithm. That parser is a finite-state machine whose alphabet is the set of terminal and non-terminal symbols of a grammar. To get the viable-prefix grammar, we use exactly the same state machine, but we remove the stack (so that it becomes just a state machine) and the reduce actions, leaving only the shift and goto actions as transitions. All states in the viable-prefix machine are accepting states, since any prefix of a viable prefix is itself a viable prefix.

    A key feature of this new automaton is that it cannot extend a prefix with a reduce action (since we removed all the reduce actions). A prefix with a reduce action is a prefix which ends in a handle -- recall that a handle is the right-hand side of some production -- so another definition of a viable prefix is that it is a right-sentential form (that is, a possible step in a derivation) which does not extend beyond the right-most handle.

    The grammar you are working with has only two productions, so there are only two handles, 01 and 0S1. Note that 10 and 1S cannot be subsequences of any right-sentential form, nor can a right-sentential form contain more than one S. Any right-sentential form must either be a sentence 0n1n or a sentential form 0nS1n where n>0. But every handle ends at the first 1 of a sentential form, and so a viable prefix must end at or before the first 1. This produces precisely the four possibilities you list, which we can condense to the regular expression 0*0(S1?)?.

    Chopping off the suffix removed the second n from the formula, so there is no longer a requirement of concordance and the language is regular.


    Note:

    Questions like this and their answers are begging to be rendered using MathJax. StackOverflow, unfortunately, does not provide this extension, which is apparently considered unnecessary for programming. However, there is a site in the StackExchange constellation dedicated to computing science questions, http://cs.stackexchange.com, and another one dedicated to mathematical questions, http://math.stackexchange.com. Formal language theory is part of both computing science and mathematics. Both of those sites permit MathJax, and questions on those sites will not be closed because they are not programming questions. I suggest you take this information into account for questions like this one.