Search code examples
parsingyaccbnfebnfshift-reduce-conflict

Solving shift/reduce conflicts


I'm using PLY to parse this grammar. I implemented a metagrammar for EBNF used in the linked spec, but PLY reports multiple shift/reduce conflicts.

Grammar:

Rule 0     S' -> grammar
Rule 1     grammar -> prod_list
Rule 2     grammar -> empty
Rule 3     prod_list -> prod
Rule 4     prod_list -> prod prod_list
Rule 5     prod -> id : : = rule_list
Rule 6     rule_list -> rule
Rule 7     rule_list -> rule rule_list
Rule 8     rule -> rule_simple
Rule 9     rule -> rule_group
Rule 10    rule -> rule_opt
Rule 11    rule -> rule_rep0
Rule 12    rule -> rule_rep1
Rule 13    rule -> rule_alt
Rule 14    rule -> rule_except
Rule 15    rule_simple -> terminal
Rule 16    rule_simple -> id
Rule 17    rule_simple -> char_range
Rule 18    rule_group -> ( rule_list )
Rule 19    rule_opt -> rule_simple ?
Rule 20    rule_opt -> rule_group ?
Rule 21    rule_rep0 -> rule_simple *
Rule 22    rule_rep0 -> rule_group *
Rule 23    rule_rep1 -> rule_simple +
Rule 24    rule_rep1 -> rule_group +
Rule 25    rule_alt -> rule | rule
Rule 26    rule_except -> rule - rule_simple
Rule 27    rule_except -> rule - rule_group
Rule 28    terminal -> SQ string_no_sq SQ
Rule 29    terminal -> DQ string_no_dq DQ
Rule 30    string_no_sq -> LETTER string_no_sq
Rule 31    string_no_sq -> DIGIT string_no_sq
Rule 32    string_no_sq -> SYMBOL string_no_sq
Rule 33    string_no_sq -> DQ string_no_sq
Rule 34    string_no_sq -> + string_no_sq
Rule 35    string_no_sq -> * string_no_sq
Rule 36    string_no_sq -> ( string_no_sq
Rule 37    string_no_sq -> ) string_no_sq
Rule 38    string_no_sq -> ? string_no_sq
Rule 39    string_no_sq -> | string_no_sq
Rule 40    string_no_sq -> [ string_no_sq
Rule 41    string_no_sq -> ] string_no_sq
Rule 42    string_no_sq -> - string_no_sq
Rule 43    string_no_sq -> : string_no_sq
Rule 44    string_no_sq -> = string_no_sq
Rule 45    string_no_sq -> empty
Rule 46    string_no_dq -> LETTER string_no_dq
Rule 47    string_no_dq -> DIGIT string_no_dq
Rule 48    string_no_dq -> SYMBOL string_no_dq
Rule 49    string_no_dq -> SQ string_no_dq
Rule 50    string_no_dq -> + string_no_dq
Rule 51    string_no_dq -> * string_no_dq
Rule 52    string_no_dq -> ( string_no_dq
Rule 53    string_no_dq -> ) string_no_dq
Rule 54    string_no_dq -> ? string_no_dq
Rule 55    string_no_dq -> | string_no_dq
Rule 56    string_no_dq -> [ string_no_dq
Rule 57    string_no_dq -> ] string_no_dq
Rule 58    string_no_dq -> - string_no_dq
Rule 59    string_no_dq -> : string_no_dq
Rule 60    string_no_dq -> = string_no_dq
Rule 61    string_no_dq -> empty
Rule 62    id -> LETTER LETTER id
Rule 63    id -> LETTER DIGIT id
Rule 64    id -> LETTER
Rule 65    id -> DIGIT
Rule 66    rest_of_id -> LETTER rest_of_id
Rule 67    rest_of_id -> DIGIT rest_of_id
Rule 68    rest_of_id -> empty
Rule 69    char_range -> [ UNI_CH - UNI_CH ]
Rule 70    empty -> <empty>

Conflicts:

id  : LETTER LETTER id
            | LETTER DIGIT id
            | LETTER
            | DIGIT

.

state 4

    (62) id -> LETTER . LETTER id
    (63) id -> LETTER . DIGIT id
    (64) id -> LETTER .

  ! shift/reduce conflict for LETTER resolved as shift
  ! shift/reduce conflict for DIGIT resolved as shift
    LETTER          shift and go to state 10
    DIGIT           shift and go to state 9
    |               reduce using rule 64 (id -> LETTER .)
    -               reduce using rule 64 (id -> LETTER .)
    (               reduce using rule 64 (id -> LETTER .)
    SQ              reduce using rule 64 (id -> LETTER .)
    DQ              reduce using rule 64 (id -> LETTER .)
    [               reduce using rule 64 (id -> LETTER .)
    $end            reduce using rule 64 (id -> LETTER .)
    )               reduce using rule 64 (id -> LETTER .)
    :               reduce using rule 64 (id -> LETTER .)
    ?               reduce using rule 64 (id -> LETTER .)
    *               reduce using rule 64 (id -> LETTER .)
    +               reduce using rule 64 (id -> LETTER .)

  ! LETTER          [ reduce using rule 64 (id -> LETTER .) ]
  ! DIGIT           [ reduce using rule 64 (id -> LETTER .) ]

The id rule is supposed to guarantee that productions' ids start with a letter.

Next conflict:

    rule_alt        : rule '|' rule

.

state 113

    (25) rule_alt -> rule | rule .
    (25) rule_alt -> rule . | rule
    (26) rule_except -> rule . - rule_simple
    (27) rule_except -> rule . - rule_group

  ! shift/reduce conflict for | resolved as shift
  ! shift/reduce conflict for - resolved as shift
    (               reduce using rule 25 (rule_alt -> rule | rule .)
    SQ              reduce using rule 25 (rule_alt -> rule | rule .)
    DQ              reduce using rule 25 (rule_alt -> rule | rule .)
    LETTER          reduce using rule 25 (rule_alt -> rule | rule .)
    DIGIT           reduce using rule 25 (rule_alt -> rule | rule .)
    [               reduce using rule 25 (rule_alt -> rule | rule .)
    )               reduce using rule 25 (rule_alt -> rule | rule .)
    $end            reduce using rule 25 (rule_alt -> rule | rule .)
    |               shift and go to state 76
    -               shift and go to state 74

  ! |               [ reduce using rule 25 (rule_alt -> rule | rule .) ]
  ! -               [ reduce using rule 25 (rule_alt -> rule | rule .) ]

Connected to a smiliar one:

rule_except     : rule '-' rule_simple
                | rule '-' rule_group

How do I fix these?


Solution

  • You really should think seriously about using the usual scanner/parser architecture. Otherwise, you will have to find a way to deal with whitespace.

    As it is, you seem to be ignoring whitespace altogether. That means that the parser cannot see the whitespace between three consecutive identifiers. It will see them run together as asoupofundifferentiatedletters, and it has no way to know what the original intent was. This makes your grammar deeply ambiguous, because in the grammar two identifiers can follow each other on the assumption that something will cause them to be differentiated from each other. And ambiguous grammars always result in LR conflicts.

    Having the identifiers (and other multi-character tokens) recognized by the lexer is much easier. Otherwise, you will have to rewrite your grammar to identify all the places where whitespace is allowed (such as around the punctuation in (identifer1|identifier2)) or required (such as two identifiers).

    Identifying identifiers in the scanner using regular expressions will also remove the other problems with your grammar and identifiers:

    id -> LETTER LETTER id
    id -> LETTER DIGIT id
    id -> LETTER
    

    These rules require id to be an odd number of characters, where the digits only appear in even positions. So a1b would be an id, but not ab1 or ab or a1. I'm sure that's not what you meant.

    You seem to be trying to avoid left-recursion. Instead, you should embrace left-recursion. Bottom-up parsers, like PLY, love left-recursion. (They handle right-recursion, but at the cost of excessive parser stack usage.) So what you really want is:

    id: LETTER | id LETTER | id DIGIT
    

    There are other places in the grammar where similar changes are necessary.

    The other conflict is caused by your unorthodox handling of operator precedence, which might also be a result of your attempt to avoid left-recursion. The EBNF operators can be parsed with a simple precedence scheme, as with algebraic operators. However, the use of precedence declarations (%left and friends) will be complicated because of the "invisible" concatenation operator. Generally, you'll find it easier to use explicit precedence as in the standard expr/factor/term algebraic grammar. In your case, the equivalent would be something like:

    item: id
        | terminal
        | '(' rule ')'
    term: item
        | item '*'
        | item '+'
        | item '?'
    seq : term
        | seq term
    alt : seq
        | alt '|' seq
    except: term '-' term
    rule: alt
        | except
    

    The handling of except in the above corresponds to the lack of information about the precedence of the - operator. That's expressed by effectively disallowing any mix of - and | operators without explicit parentheses.

    You will also find that you have a shift/reduce conflict here:

    # The following will create a problem
    prod: id "::=" rule
    prod_list
        : prod
        | prod_list prod
    

    (NOTE: the fact that I wrote that with left-recursion does not create the problem.)

    That is not ambiguous, but it is not left-to-right parseable with a single lookahead token. It requires two tokens, because you cannot know whether or not the id is part of the currently-being-parsed sequence, or the beginning of a new production until you see the token after the id: if it is ::=, then the id was the start of a new production and should not be shifted into the current rule. The usual solution to that problem is a hack in the lexer: the lexer is wrapped by a function which keeps one extra token of lookahead, so that it can emit id ::= as a single token of type definition. There are a number of examples of this hack for various LR parsers in other SO questions.


    Having said all of that, I really don't understand why you want to build a parser for EBNF in order to parse XML. Building a working parser from EBNF is basically what PLY does, except that it doesn't implement the "E" part, so you have to rewrite rules which use the ?, *, + and - operators. This can be handled automatically, although the - operator is non-trivial in general, but it is not going to be simple. It would be easier, IMHO, to rewrite the few EBNF rules into BNF and then just use PLY. But if you are looking for a challenge, go for it.