I'm using PLY to parse this grammar. I implemented a metagrammar for EBNF used in the linked spec, but PLY reports multiple shift/reduce conflicts.
Grammar:
Rule 0 S' -> grammar
Rule 1 grammar -> prod_list
Rule 2 grammar -> empty
Rule 3 prod_list -> prod
Rule 4 prod_list -> prod prod_list
Rule 5 prod -> id : : = rule_list
Rule 6 rule_list -> rule
Rule 7 rule_list -> rule rule_list
Rule 8 rule -> rule_simple
Rule 9 rule -> rule_group
Rule 10 rule -> rule_opt
Rule 11 rule -> rule_rep0
Rule 12 rule -> rule_rep1
Rule 13 rule -> rule_alt
Rule 14 rule -> rule_except
Rule 15 rule_simple -> terminal
Rule 16 rule_simple -> id
Rule 17 rule_simple -> char_range
Rule 18 rule_group -> ( rule_list )
Rule 19 rule_opt -> rule_simple ?
Rule 20 rule_opt -> rule_group ?
Rule 21 rule_rep0 -> rule_simple *
Rule 22 rule_rep0 -> rule_group *
Rule 23 rule_rep1 -> rule_simple +
Rule 24 rule_rep1 -> rule_group +
Rule 25 rule_alt -> rule | rule
Rule 26 rule_except -> rule - rule_simple
Rule 27 rule_except -> rule - rule_group
Rule 28 terminal -> SQ string_no_sq SQ
Rule 29 terminal -> DQ string_no_dq DQ
Rule 30 string_no_sq -> LETTER string_no_sq
Rule 31 string_no_sq -> DIGIT string_no_sq
Rule 32 string_no_sq -> SYMBOL string_no_sq
Rule 33 string_no_sq -> DQ string_no_sq
Rule 34 string_no_sq -> + string_no_sq
Rule 35 string_no_sq -> * string_no_sq
Rule 36 string_no_sq -> ( string_no_sq
Rule 37 string_no_sq -> ) string_no_sq
Rule 38 string_no_sq -> ? string_no_sq
Rule 39 string_no_sq -> | string_no_sq
Rule 40 string_no_sq -> [ string_no_sq
Rule 41 string_no_sq -> ] string_no_sq
Rule 42 string_no_sq -> - string_no_sq
Rule 43 string_no_sq -> : string_no_sq
Rule 44 string_no_sq -> = string_no_sq
Rule 45 string_no_sq -> empty
Rule 46 string_no_dq -> LETTER string_no_dq
Rule 47 string_no_dq -> DIGIT string_no_dq
Rule 48 string_no_dq -> SYMBOL string_no_dq
Rule 49 string_no_dq -> SQ string_no_dq
Rule 50 string_no_dq -> + string_no_dq
Rule 51 string_no_dq -> * string_no_dq
Rule 52 string_no_dq -> ( string_no_dq
Rule 53 string_no_dq -> ) string_no_dq
Rule 54 string_no_dq -> ? string_no_dq
Rule 55 string_no_dq -> | string_no_dq
Rule 56 string_no_dq -> [ string_no_dq
Rule 57 string_no_dq -> ] string_no_dq
Rule 58 string_no_dq -> - string_no_dq
Rule 59 string_no_dq -> : string_no_dq
Rule 60 string_no_dq -> = string_no_dq
Rule 61 string_no_dq -> empty
Rule 62 id -> LETTER LETTER id
Rule 63 id -> LETTER DIGIT id
Rule 64 id -> LETTER
Rule 65 id -> DIGIT
Rule 66 rest_of_id -> LETTER rest_of_id
Rule 67 rest_of_id -> DIGIT rest_of_id
Rule 68 rest_of_id -> empty
Rule 69 char_range -> [ UNI_CH - UNI_CH ]
Rule 70 empty -> <empty>
Conflicts:
id : LETTER LETTER id
| LETTER DIGIT id
| LETTER
| DIGIT
.
state 4
(62) id -> LETTER . LETTER id
(63) id -> LETTER . DIGIT id
(64) id -> LETTER .
! shift/reduce conflict for LETTER resolved as shift
! shift/reduce conflict for DIGIT resolved as shift
LETTER shift and go to state 10
DIGIT shift and go to state 9
| reduce using rule 64 (id -> LETTER .)
- reduce using rule 64 (id -> LETTER .)
( reduce using rule 64 (id -> LETTER .)
SQ reduce using rule 64 (id -> LETTER .)
DQ reduce using rule 64 (id -> LETTER .)
[ reduce using rule 64 (id -> LETTER .)
$end reduce using rule 64 (id -> LETTER .)
) reduce using rule 64 (id -> LETTER .)
: reduce using rule 64 (id -> LETTER .)
? reduce using rule 64 (id -> LETTER .)
* reduce using rule 64 (id -> LETTER .)
+ reduce using rule 64 (id -> LETTER .)
! LETTER [ reduce using rule 64 (id -> LETTER .) ]
! DIGIT [ reduce using rule 64 (id -> LETTER .) ]
The id
rule is supposed to guarantee that productions' ids start with a letter.
Next conflict:
rule_alt : rule '|' rule
.
state 113
(25) rule_alt -> rule | rule .
(25) rule_alt -> rule . | rule
(26) rule_except -> rule . - rule_simple
(27) rule_except -> rule . - rule_group
! shift/reduce conflict for | resolved as shift
! shift/reduce conflict for - resolved as shift
( reduce using rule 25 (rule_alt -> rule | rule .)
SQ reduce using rule 25 (rule_alt -> rule | rule .)
DQ reduce using rule 25 (rule_alt -> rule | rule .)
LETTER reduce using rule 25 (rule_alt -> rule | rule .)
DIGIT reduce using rule 25 (rule_alt -> rule | rule .)
[ reduce using rule 25 (rule_alt -> rule | rule .)
) reduce using rule 25 (rule_alt -> rule | rule .)
$end reduce using rule 25 (rule_alt -> rule | rule .)
| shift and go to state 76
- shift and go to state 74
! | [ reduce using rule 25 (rule_alt -> rule | rule .) ]
! - [ reduce using rule 25 (rule_alt -> rule | rule .) ]
Connected to a smiliar one:
rule_except : rule '-' rule_simple
| rule '-' rule_group
How do I fix these?
You really should think seriously about using the usual scanner/parser architecture. Otherwise, you will have to find a way to deal with whitespace.
As it is, you seem to be ignoring whitespace altogether. That means that the parser cannot see the whitespace between three consecutive identifiers
. It will see them run together as asoupofundifferentiatedletters
, and it has no way to know what the original intent was. This makes your grammar deeply ambiguous, because in the grammar two identifiers can follow each other on the assumption that something will cause them to be differentiated from each other. And ambiguous grammars always result in LR conflicts.
Having the identifiers (and other multi-character tokens) recognized by the lexer is much easier. Otherwise, you will have to rewrite your grammar to identify all the places where whitespace is allowed (such as around the punctuation in (identifer1|identifier2)
) or required (such as two identifiers
).
Identifying identifiers in the scanner using regular expressions will also remove the other problems with your grammar and identifiers:
id -> LETTER LETTER id
id -> LETTER DIGIT id
id -> LETTER
These rules require id
to be an odd number of characters, where the digits only appear in even positions. So a1b
would be an id
, but not ab1
or ab
or a1
. I'm sure that's not what you meant.
You seem to be trying to avoid left-recursion. Instead, you should embrace left-recursion. Bottom-up parsers, like PLY, love left-recursion. (They handle right-recursion, but at the cost of excessive parser stack usage.) So what you really want is:
id: LETTER | id LETTER | id DIGIT
There are other places in the grammar where similar changes are necessary.
The other conflict is caused by your unorthodox handling of operator precedence, which might also be a result of your attempt to avoid left-recursion. The EBNF operators can be parsed with a simple precedence scheme, as with algebraic operators. However, the use of precedence declarations (%left
and friends) will be complicated because of the "invisible" concatenation operator. Generally, you'll find it easier to use explicit precedence as in the standard expr
/factor
/term
algebraic grammar. In your case, the equivalent would be something like:
item: id
| terminal
| '(' rule ')'
term: item
| item '*'
| item '+'
| item '?'
seq : term
| seq term
alt : seq
| alt '|' seq
except: term '-' term
rule: alt
| except
The handling of except
in the above corresponds to the lack of information about the precedence of the -
operator. That's expressed by effectively disallowing any mix of -
and |
operators without explicit parentheses.
You will also find that you have a shift/reduce conflict here:
# The following will create a problem
prod: id "::=" rule
prod_list
: prod
| prod_list prod
(NOTE: the fact that I wrote that with left-recursion does not create the problem.)
That is not ambiguous, but it is not left-to-right parseable with a single lookahead token. It requires two tokens, because you cannot know whether or not the id
is part of the currently-being-parsed sequence, or the beginning of a new production until you see the token after the id
: if it is ::=
, then the id
was the start of a new production and should not be shifted into the current rule. The usual solution to that problem is a hack in the lexer: the lexer is wrapped by a function which keeps one extra token of lookahead, so that it can emit id ::=
as a single token of type definition
. There are a number of examples of this hack for various LR parsers in other SO questions.
Having said all of that, I really don't understand why you want to build a parser for EBNF in order to parse XML. Building a working parser from EBNF is basically what PLY does, except that it doesn't implement the "E" part, so you have to rewrite rules which use the ?
, *
, +
and -
operators. This can be handled automatically, although the -
operator is non-trivial in general, but it is not going to be simple. It would be easier, IMHO, to rewrite the few EBNF rules into BNF and then just use PLY. But if you are looking for a challenge, go for it.