Search code examples
grammarcontext-free-grammarparse-tree

How do you convert the following ambiguous grammar to unambiguous?


I understand how the difference between the two, how ambiguity means that there is at least one string with 2 distinct parse trees while there is only one in an unambiguous tree. But I can't seem to convert one to the other.

How would I convert the following ambiguous grammar to an unambiguous one?

S -> aSb
S -> abS
S -> lambda

Edit: Ok, my stab at this would be something like

S -> aSb | lambda
b -> abS | lambda

any thoughts?


Solution

  • The grammar is ambiguous not only because there are two rules that match 'a' as the next token - but because 'ab' can be matched either by the first or second rule (substituting using the third for S in each).

    There is such a thing as an inherently ambiguous grammar, but this isn't one.

    Focusing on this specific example, I started by enumerating the strings that would parse. I numbered the rules 1,2 and 3 - and considered all the sequences in which rules 1 and 2 could appear in the parse (these being the two rules that generate terminals.) N.B. I assumed "lambda" denoted the empty production.

    1,2 => ab
    11,12 => abab
    21,22 => aabb
    111,112 => ababab
    121,122 => abaabb
    211,212 => aababb
    221,222 => aaabbb
    1111,1112 => abababab
    1121,1122 => ababaabb
    1211,1212 => abaababb
    1221,1222 => abaaabbb
    2111,2112 => aabababb
    2121,2122 => aabaabbb
    2211,2212 => aaababbb
    2221,2222 => aaaabbbb
    

    From this exercise, it is obvious that we're matching even length strings of 'a and b' where the number of 'a' terminals exactly matches the number of 'b' terminals... Further, the concatenation of two strings that match only results in another matching string if the prefix matched using the second rule.

    From this analysis, I drew up some new productions.

    S -> a a X
    S -> a b S
    S -> lambda
    X -> S b b
    

    This new grammar is not ambiguous, but it matches the same strings as the ambiguous grammar. It achieves this by introducing a new non-terminal X. When this CFG is used with a push-down automata, the additional state-information arising from using both S and X is sufficient to avoid ambiguity.

    If this problem arose in the context of using something like Yacc or Bison, the ambiguity is often an indication that you've made a poor choice of terminal tokens. If you'd picked 'aa', 'ab' and 'bb' as terminals - you'd not have run into difficulty. When using (F)lex as a tokenizer, as a rule of thumb, it's a good idea to make the tokens it matches as big as makes sense... as it's quicker to match a regular expression (in theory at least) than a context free grammar - and this might have yielded the two-character token approach as a matter of course.