I am working on a parsing project that uses an adaption of this grammar for Perl's regular expressions http://www.cs.sfu.ca/~cameron/Teaching/384/99-3/regexp-plg.html. I have simplified this grammar for my own purposes, like so (note that, because "|" is a token, I am instead using a comma "," so seperate productions for a given nonterminal):
<RE> := <union>, <simple>
<union> := <RE> | <simple>
<simple> := <concat>, <basic>
<concat> := <simple><basic>
<basic> := <zero>, <one>, <onezero>, <elem>
<zero> := <elem>*
<one> := <elem>+
<onezero> := <elem>?
<elem> := <group>, <any>, <literal>, <class>
<group> := (<RE>)
<class> := [<items>]
<items> := <item>, <item><items>
<item> := <range>, <literal>
I want to write a LL(1) parser to handle this grammar, and for an LL(1) parser the productions for <items>
have some ambiguity. To fix this, I could left-factor them by adding a new nonterminal <X>
, like so:
<items> := <item><X>
<X> := <items>, epsilon
But what I'm wondering is, could I just flip around the order of the second production in <items>
, like this:
<items> := <item>, <items><item>
and avoid adding a new nonterminal? It doesn't look like it breaks anything, after all the whole point of this production is to allow any variable number of sequential <item>
symbols, and we still get that if we reverse the order. Am I missing something, or does simply reversing the order achieve the same goal as left-factoring in this situation?
The problem you are trying to fix is that
items → item
items → item items
is not left-factored; both productions start with item
.
Your proposed fix
items → item
items → items item
doesn't really help (whatever starts item
can still start either production of items
), but more importantly, it is left-recursive, which is verboten for LL grammars.
In principle, the "new non-terminal" is the correct solution, but in a recursive descent parser, you would probably do something like this:
def items():
list = [ item() ]
while couldStart(item, lookahead):
list.append(item())
return list