javascript ecmascript-next compiler-construction

How to convert regular grammar with recursion and alternations into regular expression

A grammar is regular if it is either right-linear or left-linear. This tutorial claims that because of that it has the special property:

A regular grammar has a special property: by substituting every nonterminal (except the root one) with its righthand side, you can reduce it down to a single production for the root, with only terminals and operators on the right-hand side... The reduced expression of terminals and operators can be written in an even more compact form, called a regular expression

So I've decided to test that idea and convert the regular EcmaScript grammar for IdentifierName into regular expressions:

IdentifierName ::
    IdentifierStart
    IdentifierName  IdentifierPart

Suppose IdentifierStart and IdentifierPart are limited to the following:

IdentifierStart ::       IdentifierPart ::
    A                        A                 
    B                        C
    C                        &
    $                    
    _

But I'm not sure how to proceed since the grammar for IdentifierName has both recursion and alternation. Any help?

I'm more interested in the approach rather than in finding the resulting regexp which as @Bergi showed is [ABC$_][AC&]*.

Solution

That tutorial is using some non-standard (and surprisingly implicit) definitions.

First of all they use repetition operators in their grammar as they might be found in regular expressions or EBNF. Then they implicitly define a regular grammar to be one that only uses those repetition operators and no recursion. Given that, it's trivial to turn a "regular grammar" into a regex by just inlining all the non-terminals. But by that definition the JS spec's grammar for identifiers is not regular because it contains recursion. So before you could inline everything, you'd first need to replace the recursion with repetition operators.

However this isn't the standard definition of what a regular grammar is. The standard definition is as you said: a grammar is regular if it's either left-linear or right-linear - that is if only the left-most item of a production is a non-terminal or if only the right-most one is. Repetition operators don't exist in the usual definition of a formal grammar.

Now these regular grammars can also be converted to regular expressions, but not by merely applying the method described in the tutorial. One way would be to convert the grammar to a finite automaton and then apply the algorithm described in this answer for example.

However in practice, when doing the conversion by hand (rather than writing a program to do it) the easiest and most common way to perform the conversion is to think about what language the grammar describes (in this case "the language of all words that start with an IdentifierStart symbol and then contain 0 or more IdentifierPart symbols") and then come up with a regular expression that expresses that language (a.k.a. the "look really hard at the problem until you see the solution"-algorithm).