Search code examples
parsingcompiler-constructionbnf

Looking for advice on making this BNF grammar suitable for LL(1) parsing (left factoring)


I am working on a parsing project that uses an adaption of this grammar for Perl's regular expressions http://www.cs.sfu.ca/~cameron/Teaching/384/99-3/regexp-plg.html. I have simplified this grammar for my own purposes, like so (note that, because "|" is a token, I am instead using a comma "," so seperate productions for a given nonterminal):

<RE>      := <union>, <simple>
<union>   := <RE> | <simple>
<simple>  := <concat>, <basic>
<concat>  := <simple><basic>
<basic>   := <zero>, <one>, <onezero>, <elem>
<zero>    := <elem>*
<one>     := <elem>+
<onezero> := <elem>?
<elem>    := <group>, <any>, <literal>, <class>
<group>   := (<RE>)
<class>   := [<items>]
<items>   := <item>, <item><items>
<item>    := <range>, <literal>

I want to write a LL(1) parser to handle this grammar, and for an LL(1) parser the productions for <items> have some ambiguity. To fix this, I could left-factor them by adding a new nonterminal <X>, like so:

<items>   :=  <item><X>
<X>       :=  <items>, epsilon

But what I'm wondering is, could I just flip around the order of the second production in <items>, like this:

<items>   := <item>, <items><item>

and avoid adding a new nonterminal? It doesn't look like it breaks anything, after all the whole point of this production is to allow any variable number of sequential <item> symbols, and we still get that if we reverse the order. Am I missing something, or does simply reversing the order achieve the same goal as left-factoring in this situation?


Solution

  • The problem you are trying to fix is that

    items → item
    items → item items
    

    is not left-factored; both productions start with item.

    Your proposed fix

    items → item
    items → items item
    

    doesn't really help (whatever starts item can still start either production of items), but more importantly, it is left-recursive, which is verboten for LL grammars.

    In principle, the "new non-terminal" is the correct solution, but in a recursive descent parser, you would probably do something like this:

    def items():
      list = [ item() ]
      while couldStart(item, lookahead):
        list.append(item())
      return list