Search code examples
context-free-grammarregular-languagecontext-free-languagechomsky-normal-formchomsky-hierarchy

How to convert to Chomsky Normal Form quickly?


So I know the basic 4 step way of doing it. Removing the epsilons, then the variables less than 2 and so on. However, that way takes way too long for the problems we will have to do on the test.

Here is an example:
Convert this context-free grammar to an equivalent Chomsky normal form grammar. Remember to remove all useless symbols from the grammar.

S → TaXU | STUVWXY
T → UU | abc
U → bSc | ε
V → aV | Wb
W → cW | Va
X → bT | Tc
Y → cba | ε

This is 1 of 6 questions on the test. We have 50 minutes to complete the entire test. I can do this one, but it takes me like 30-40 minutes each time. Does anyone know any tricks or shortcuts to make something like this go quicker?

Thanks.


Solution

  • There are a few strategies you might use to reduce the time to get to an answer. First, you might try to understand directly the language of the grammar, and rather than converting the given grammar, writing down a new grammar in CNF for the same language. I don't know that this is particularly useful for the example given, but it might be useful, particularly if - on a test - you recognize a grammar as corresponding to a known language, or if the language is named or otherwise described.

    Otherwise, I would look for ways to simplify the grammar before attempting to remove ε. That step increases the size of your grammar, and you want that to be done as late as possible to avoid having to think about a bigger grammar earlier on. This has some nontrivial payoff in this example.

    First, see whether every symbol leads to a string in the language. This should be pretty quick. Eliminate any symbols which do not, as they are not even potentially useful.

    1. If a nonterminal leads to a string of terminals, it is potentially useful.
    2. If a nonterminal leads to a string of terminals and potentially useful nonterminals, it is potentially useful.

    For your grammar:

    • Y -> cba so Y is potentially useful
    • X -> bT -> babc so T is potentially useful
    • W and V do not lead anywhere useful; they only derive strings including themselves or each other, and neither one has proven to be potentially useful. These symbols, and any productions containing them, can be immediately discarded.
    • U -> e so U is potentially useful.
    • T -> abc so T is potentially useful
    • S -> TaXU -> abcaXU -> abcabTU -> abcababcU -> abcababc so S is potentially useful (too bad!)

    This already gained us a lot. Consider the new grammar:

    S → TaXU
    T → UU | abc 
    U → bSc | ε 
    X → bT | Tc 
    Y → cba | ε
    

    Next, look for any nonterminal symbols other than S that don't appear in a remaining production. We can quickly see that Y is not reachable from S in this new grammar and can remove it to get:

    S → TaXU
    T → UU | abc 
    U → bSc | ε 
    X → bT | Tc
    

    It might be possible that the above steps could be usefully repeated to continue removing useless nonterminal symbols and productions. I think everything remaining is useful.

    Now we can eliminate ε. The standard way of doing this is by adding productions that use ε in place of U. We have two productions that use U and three new productions to add. Our new grammar looks like this:

    S → TaXU | TaX
    T → UU | U | abc | ε 
    U → bSc
    X → bT | Tc
    

    Repeating for T:

    S → TaXU | aXU | TaX | aX
    T → UU | U | abc
    U → bSc
    X → bT | Tc | b | c
    

    We only have one production that take one nonterminal to another, and we can eliminate that by substitution:

    S → TaXU | aXU | TaX | aX
    T → UU | bSc | abc
    U → bSc
    X → bT | Tc | b | c
    

    Now, what's left shouldn't take too long - we're past the hardest part, which is eliminating empty productions and productions deriving a single nonterminal. Now, we just need to introduce productions for terminal symbols and for strings of terminals and nonterminals. I recommend you start with shorter strings and work your way up.

    We see some terminal symbols appearing alongside nonterminals, which cannot be, so you can always just add new nonterminals for them:

    S → TAXU | AXU | TAX | AX
    T → UU | BSC | ABC
    U → BSC
    X → BT | TC | b | c
    A → a
    B → b
    C → c
    

    Now, starting with the shortest strings (length > 2), add new symbols that derive two at a time. To save time, just work left to right. We see BSC, ABC, TAX, AXU and TAXU. We can add G → BS, H -> AB, I → TA, J → XU and get:

    S → IJ | AJ | IX | AX
    T → UU | GC | HC
    U → GC
    G → BS
    H → AB
    I → TA
    J → XU
    X → BT | TC | b | c
    A → a
    B → b
    C → c
    

    Now, did this take then 8 minutes you'd have to do the problem (6 problems, 50 minutes) in a testing situation? That's a bit tight. It certainly took me longer to type the explanation. Crossing out symbols and productions should be quick, but adding productions means writing them down, which takes some time.