Search code examples
compiler-constructioncomputer-science

Dragon book exercise 4.2.3 f: Grammar for strings whose two halfs differ


I can solve 4.2.3 a~e by myself, but f is too hard for me and I can't even find the answer using google.

Design grammars for the following languages:

(f) The set of all strings of 0s and 1s of the form xy, where x ≠ y and x and y are of the same length


Solution

  • The interesting thing about this language is its relationship with

    Ldup = { ωω | ω ∈ Σ*}

    which is the classic example of a language which is not context-free. But the language from the exercise

    Ldiff = { γζ | γ ≠ ζ ∧ γ ∈ Σ* ∧ ζ ∈ Σ* }

    has the property that

    Ldiff ⋃ Ldup ⋃ Lodd = Σ* and Ldup, Ldiff, Lodd are disjoint.

    where

    Lodd = { ω | |ω| is odd ∧ ω ∈ Σ*}

    In other words, if you split a string into two equal-length parts, the two parts are either equal or unequal, and if you can't split the string that way, its length is odd.

    Lodd is clearly context free -- in fact, it is regular -- so the fact that Ldiff is context-free leads to the conclusion that the complement of a context-free language can be not context-free.

    Given that, it's natural to think about Ldiff as being composed of the two equal-length pieces γ and ζ. But that frames the problem in a way that makes it difficult, if not impossible, to solve. As with many maths problems (and, I dare say, real-world problems), the solution has to start by reframing the problem to reveal its underlying simplicity. ("Thinking outside of the box", to use the famous cliché.)

    Let's say that γ and ζ are both of length n. Since they are different, they must differ in at least one position. Let's say that k is such a position, so that γkζk.

    Now, let's try to characterise the pairs of strings which satisfy that constraint. All of the symbols other than the ones at position k are irrelevant, so what we end up with is:

    ΣkγkΣn-k-1 ΣkζkΣn-k-1

    That's obviously the same as

    ΣkγkΣk Σn-k-1ζkΣn-k-1

    and letting j = nk−1, a = γk and b = ζk, we get

    Σkk Σjj, a ≠ b

    j and k and the two symbols a and b are all arbitrary in that expression. So in effect, we've described the set of strings which can be divided into two odd-length parts such that the symbols in the middle of the two parts differ. With binary strings, there are only two pairs of symbols which differ: <0, 1> and <1, 0>. So that leads to the following grammar:

    W → 0 | 1               /* Wildcard */
    A → 0 | W A W           /* Strings with a 0 in the middle */
    B → 1 | W B W           /* Strings with a 1 in the middle */
    S → A B | B A           /* Ldiff */