Search code examples
regexparsinggrammardfaparser-generator

Convert a dfa to rule for a asterisk case


Here is a simple but very common grammar rule case in EBNF format, the Statements is a none terminal symbol and Statement is none terminal symbol:

Statements ::= (Statement ';')*

After converting this rule to NFA, and then do the subset contruction for converting the NFA to DFA, and at last get the dfa:

State0 -> Statement -> State1 -> ';' ->State0
State0 -> ε -> State0

The State0 is the DFA's start state representing the none terminal symbol Statements, also it is the finish state. From State0 input Statement and traslate to State1 and input ';' at State1, translate to State0. Also, State0 could translate to self with the ε.

And after converting the above dfa to regular grammar following the algorithm in dragon book, i get the following grammar rules:

Statements -> ε
Statements -> Statement Extend_NT
Extend_NT  -> ';' Statements

It added the new none terminal symbol Extend_NT, but i want to get the following the regular grammars which does not contain the extend symbol Extend_NT:

Statements -> ε
Statements -> Statement ';' Statements

So the question is that is there any algorithm could get the above result that does not contain the new none terminal symbol Extend_NT?

Or it is just a engineering problem?


Solution

  • I'm not really sure I understand your question.

    If you just have a single production for a non-terminal and that production just has a single repetition operator, either at the beginning or the end, you can apply a simple desugaring: (Here α and β are sequences of grammar symbols (but not EBNF symbols), and α might be empty.)

    N ::= α (β)* ⇒  N → α | N β
    N ::= (β)* α ⇒  N → α | β N
    

    If α is empty, you could use either of the above. For an LALR(1) parser generator, the left-recursive version would be the usual choice; if you're building an LL parser, you will of course want the right-recursive version.

    If there is more than one EBNF operator in the right-hand side, then you will need extra non-terminals, one for each EBNF repetition operator, except possibly one.

    I don't know whether you'd call that an algorithm. I personally think of it as engineering, but the difference is not absolute.