How do you convert the following ambiguous grammar to unambiguous?

I understand how the difference between the two, how ambiguity means that there is at least one string with 2 distinct parse trees while there is only one in an unambiguous tree. But I can't seem to convert one to the other.

How would I convert the following ambiguous grammar to an unambiguous one?

S -> aSb
S -> abS
S -> lambda

Edit: Ok, my stab at this would be something like

S -> aSb | lambda
b -> abS | lambda

any thoughts?

Solution

The grammar is ambiguous not only because there are two rules that match 'a' as the next token - but because 'ab' can be matched either by the first or second rule (substituting using the third for S in each).

There is such a thing as an inherently ambiguous grammar, but this isn't one.

Focusing on this specific example, I started by enumerating the strings that would parse. I numbered the rules 1,2 and 3 - and considered all the sequences in which rules 1 and 2 could appear in the parse (these being the two rules that generate terminals.) N.B. I assumed "lambda" denoted the empty production.

1,2 => ab
11,12 => abab
21,22 => aabb
111,112 => ababab
121,122 => abaabb
211,212 => aababb
221,222 => aaabbb
1111,1112 => abababab
1121,1122 => ababaabb
1211,1212 => abaababb
1221,1222 => abaaabbb
2111,2112 => aabababb
2121,2122 => aabaabbb
2211,2212 => aaababbb
2221,2222 => aaaabbbb

From this exercise, it is obvious that we're matching even length strings of 'a and b' where the number of 'a' terminals exactly matches the number of 'b' terminals... Further, the concatenation of two strings that match only results in another matching string if the prefix matched using the second rule.

From this analysis, I drew up some new productions.

S -> a a X
S -> a b S
S -> lambda
X -> S b b

This new grammar is not ambiguous, but it matches the same strings as the ambiguous grammar. It achieves this by introducing a new non-terminal X. When this CFG is used with a push-down automata, the additional state-information arising from using both S and X is sufficient to avoid ambiguity.

If this problem arose in the context of using something like Yacc or Bison, the ambiguity is often an indication that you've made a poor choice of terminal tokens. If you'd picked 'aa', 'ab' and 'bb' as terminals - you'd not have run into difficulty. When using (F)lex as a tokenizer, as a rule of thumb, it's a good idea to make the tokens it matches as big as makes sense... as it's quicker to match a regular expression (in theory at least) than a context free grammar - and this might have yielded the two-character token approach as a matter of course.