Search code examples
parsingcontext-free-grammar

How can left-factoring eliminate backtracking while there are non-terminals?


I'm trying to implement a parser generator (for fun), following the book Engineering a Compiler. The book suggests left-factoring for eliminating backtracking:

A -> B m | B n | m
B -> m

Becomes:

A -> B A' | m
B -> m
A' -> m | n

But now, B starts with m, A' starts with m, and A has a production alternative that also starts with m, hence the start sets for A, B and A' have the common element m, disqualifying the grammar for a LL(1) parser.

Is the grammar above inherently a non-backtrack free grammar and requires a more powerful parser such as a LR(1) parser? Or am I applying the left-factoring wrong? or there needs to be a prior step to left factoring so that no rule starts with a non-terminal (is it even possible?)

I feel the book is missing some description, in lot of places and this is one example.


In the above grammar, I simplified the issue I'm facing with the following toy grammar:

start           -> fn_declaration start | fn_declaration |
fn_call         -> ID ( args ) ;
args            -> args , arg | arg |
arg             -> STR | INT | ID
fn_declaration  -> FN ID ( params ) { statements }
params          -> param , params | param |
param           -> ID ID
statements      -> statement , statements | statement |
statement       -> declaration | assignment | fn_call | ret
declaration     -> ID ID ;
assignment      -> ID = expressions ;
expressions     -> terms + expressions | terms - expressions | terms
terms           -> factor * terms | factor / terms | factor
factor          -> ( expressions ) | INT | ID
ret             -> RETURN expressions ;

To which I'm applying these steps, in this order:

  • Eliminate epsilons.
  • Eliminate left recursion.
  • Eliminate backtracking (by left-factoring).
  • Calculate first and follow sets.

All steps do their job, but I eventually have these rules:

statement => alt=[ declaration | assignment | ret | Id ( statement_p0 ], first=[Id, return, ;], follow=[,, }]
statement_p0 => alts=[ args ) ; | ) ;], first=[), Int, Str, Id], follow=[,, }]

# ...rest of the rules removed for brevity

declaration and assignment share ID as their first token kind, and fail the is-backtrack-free test.


Solution

  • In order to left-factor the grammar, you need to ensure that every symbol that begins the RHS of a rule for some non-terminal has a disjoint FIRST set from every other symbol. The easiest way to do that is to first left-expand the grammar -- expand every initial non-terminal symbol on a RHS so that the RHS of every rule begins with a terminal.

    When you do that you get

    A -> m m | m n | m
    B -> m
    

    so now you left-factor the A rules to

    A -> m A'
    A' -> m | n | ε