A grammar is regular if it is either right-linear or left-linear. This tutorial claims that because of that it has the special property:
A regular grammar has a special property: by substituting every nonterminal (except the root one) with its righthand side, you can reduce it down to a single production for the root, with only terminals and operators on the right-hand side... The reduced expression of terminals and operators can be written in an even more compact form, called a regular expression
So I've decided to test that idea and convert the regular EcmaScript grammar for IdentifierName into regular expressions:
IdentifierName ::
IdentifierStart
IdentifierName IdentifierPart
Suppose IdentifierStart
and IdentifierPart
are limited to the following:
IdentifierStart :: IdentifierPart ::
A A
B C
C &
$
_
But I'm not sure how to proceed since the grammar for IdentifierName
has both recursion and alternation. Any help?
I'm more interested in the approach rather than in finding the resulting regexp which as @Bergi showed is [ABC$_][AC&]*
.
That tutorial is using some non-standard (and surprisingly implicit) definitions.
First of all they use repetition operators in their grammar as they might be found in regular expressions or EBNF. Then they implicitly define a regular grammar to be one that only uses those repetition operators and no recursion. Given that, it's trivial to turn a "regular grammar" into a regex by just inlining all the non-terminals. But by that definition the JS spec's grammar for identifiers is not regular because it contains recursion. So before you could inline everything, you'd first need to replace the recursion with repetition operators.
However this isn't the standard definition of what a regular grammar is. The standard definition is as you said: a grammar is regular if it's either left-linear or right-linear - that is if only the left-most item of a production is a non-terminal or if only the right-most one is. Repetition operators don't exist in the usual definition of a formal grammar.
Now these regular grammars can also be converted to regular expressions, but not by merely applying the method described in the tutorial. One way would be to convert the grammar to a finite automaton and then apply the algorithm described in this answer for example.
However in practice, when doing the conversion by hand (rather than writing a program to do it) the easiest and most common way to perform the conversion is to think about what language the grammar describes (in this case "the language of all words that start with an IdentifierStart symbol and then contain 0 or more IdentifierPart symbols") and then come up with a regular expression that expresses that language (a.k.a. the "look really hard at the problem until you see the solution"-algorithm).