Dragon book exercise 4.2.3 f: Grammar for strings whose two halfs differ

I can solve 4.2.3 a~e by myself, but f is too hard for me and I can't even find the answer using google.

Design grammars for the following languages:

(f) The set of all strings of 0s and 1s of the form xy, where x ≠ y and x and y are of the same length

Solution

The interesting thing about this language is its relationship with

L_dup = { ωω | ω ∈ Σ^*}

which is the classic example of a language which is not context-free. But the language from the exercise

L_diff = { γζ | γ ≠ ζ ∧ γ ∈ Σ^* ∧ ζ ∈ Σ^* }

has the property that

L_diff ⋃ L_dup ⋃ L_odd = Σ^* and L_dup, L_diff, L_odd are disjoint.

where

L_odd = { ω | |ω| is odd ∧ ω ∈ Σ^*}

In other words, if you split a string into two equal-length parts, the two parts are either equal or unequal, and if you can't split the string that way, its length is odd.

L_odd is clearly context free -- in fact, it is regular -- so the fact that L_diff is context-free leads to the conclusion that the complement of a context-free language can be not context-free.

Given that, it's natural to think about L_diff as being composed of the two equal-length pieces γ and ζ. But that frames the problem in a way that makes it difficult, if not impossible, to solve. As with many maths problems (and, I dare say, real-world problems), the solution has to start by reframing the problem to reveal its underlying simplicity. ("Thinking outside of the box", to use the famous cliché.)

Let's say that γ and ζ are both of length n. Since they are different, they must differ in at least one position. Let's say that k is such a position, so that γ_k ≠ ζ_k.

Now, let's try to characterise the pairs of strings which satisfy that constraint. All of the symbols other than the ones at position k are irrelevant, so what we end up with is:

Σ^kγ_kΣ^n-k-1 Σ^kζ_kΣ^n-k-1

That's obviously the same as

Σ^kγ_kΣ^k Σ^n-k-1ζ_kΣ^n-k-1

and letting j = n−k−1, a = γ_k and b = ζ_k, we get

Σ^kaΣ^k Σ^jbΣ^j, a ≠ b

j and k and the two symbols a and b are all arbitrary in that expression. So in effect, we've described the set of strings which can be divided into two odd-length parts such that the symbols in the middle of the two parts differ. With binary strings, there are only two pairs of symbols which differ: <0, 1> and <1, 0>. So that leads to the following grammar:

W → 0 | 1               /* Wildcard */
A → 0 | W A W           /* Strings with a 0 in the middle */
B → 1 | W B W           /* Strings with a 1 in the middle */
S → A B | B A           /* L_diff */