Search code examples
parsingprogramming-languagesgrammarcontext-free-grammarcontext-sensitive-grammar

Does context-sensitive tokenisation require multiple goal symbols in the lexical grammar?


According to the ECMAScript spec:

There are several situations where the identification of lexical input elements is sensitive to the syntactic grammar context that is consuming the input elements. This requires multiple goal symbols for the lexical grammar.

Two such symbols are InputElementDiv and InputElementRegExp.

In ECMAScript, the meaning of / depends on the context in which it appears. Depending on the context, a / can either be a division operator, the start of a regex literal or a comment delimiter. The lexer cannot distinguish between a division operator and regex literal on its own, so it must rely on context information from the parser.

I'd like to understand why this requires the use of multiple goal symbols in the lexical grammar. I don't know much about language design so I don't know if this is due to some formal requirement of a grammar or if it's just convention.

Questions

  • Why not just use a single goal symbol like so:
InputElement ::
     [...]
     DivPunctuator
     RegularExpressionLiteral
     [...]

and let the parser tell the lexer which production to use (DivPunctuator vs RegExLiteral), rather than which goal symbol to use (InputElementDiv vs InputElementRegExp)?

  • What are some other languages that use multiple goal symbols in their lexical grammar?

  • How would we classify the ECMAScript lexical grammar? It's not context-sensitive in the sense of the formal definition of a CSG (i.e. the LHS of its productions are not surrounded by a context of terminal and nonterminal symbols).


Solution

  • Saying that the lexical production is "sensitive to the syntactic grammar context that is consuming the input elements" does not make the grammar context-sensitive, in the formal-languages definition of that term. Indeed, there are productions which are "sensitive to the syntactic grammar context" in just about every non-trivial grammar. It's the essence of parsing: the syntactic context effectively provides the set of potentially expandable non-terminals, and those will differ in different syntactic contexts, meaning that, for example, in most languages a statement cannot be entered where an expression is expected (although it's often the case that an expression is one of the manifestations of a statement).

    However, the difference does not involve different expansions for the same non-terminal. What's required in a "context-free" language is that the set of possible derivations of a non-terminal is the same set regardless of where that non-terminal appears. So the context can provide a different selection of non-terminals, but every non-terminal can be expanded without regard to its context. That is the sense in which the grammar is free of context.

    As you note, context-sensitivity is usually abstracted in a grammar by a grammar with a pattern on the left-hand side rather than a single non-terminal. In the original definition, the context --everything other than the non-terminal to be expanded-- needed to be passed through the production untouched; only a single non-terminal could be expanded, but the possible expansions depend on the context, as indicated by the productions. Implicit in the above is that there are grammars which can be written in BNF which don't even conform to that rule for context-sensitivity (or some other equivalent rule). So it's not a binary division, either context-free or context-sensitive. It's possible for a grammar to be neither (and, since the empty context is still a context, any context-free grammar is also context-sensitive). The bottom line is that when mathematicians talk, the way they use words is sometimes unexpected. But it always has a clear underlying definition.

    In formal language theory, there are not lexical and syntactic productions; just productions. If both the lexical productions and the syntactic productions are free of context, then the total grammar is free of context. From a practical viewpoint, though, combined grammars are harder to parse, for a variety of reasons which I'm not going to go into here. It turns out that it is somewhat easier to write the grammars for a language, and to parse them, with a division between lexical and syntactic parsers.

    In the classic model, the lexical analysis is done first, so that the parser doesn't see individual characters. Rather, the syntactic analysis is done with an "alphabet" (in a very expanded sense) of "lexical tokens". This is very convenient -- it means, for example, that the lexical analysis can simply drop whitespace and comments, which greatly simplifies writing a syntactic grammar. But it also reduces generality, precisely because the syntactic parser cannot "direct" the lexical analyser to do anything. The lexical analyser has already done what it is going to do before the syntactic parser is aware of its needs.

    If the parser were able to direct the lexical analyser, it would do so in the same way as it directs itself. In some productions, the token non-terminals would include InputElementDiv and while in other productions InputElementRegExp would be the acceptable non-terminal. As I noted, that's not context-sensitivity --it's just the normal functioning of a context-free grammar-- but it does require a modification to the organization of the program to allow the parser's goals to be taken into account by the lexical analyser. This is often referred to (by practitioners, not theorists) as "lexical feedback" and sometimes by terms which are rather less value neutral; it's sometimes considered a weakness in the design of the language, because the neatly segregated lexer/parser architecture is violated. C++ is a pretty intense example, and indeed there are C++ programs which are hard for humans to parse as well, which is some kind of indication. But ECMAScript does not really suffer from that problem; human beings usually distinguish between the division operator and the regexp delimiter without exerting any noticeable intellectual effort. And, while the lexical feedback required to implement an ECMAScript parser does make the architecture a little less tidy, it's really not a difficult task, either.

    Anyway, a "goal symbol" in the lexical grammar is just a phrase which the authors of the ECMAScript reference decided to use. Those "goal symbols" are just ordinary lexical non-terminals, like any other production, so there's no difference between saying that there are "multiple goal symbols" and saying that the "parser directs the lexer to use a different production", which I hope addresses the question you asked.

    Notes

    1. The lexical difference in the two contexts is not just that / has a different meaning. If that were all that it was, there would be no need for lexical feedback at all. The problem is that the tokenization itself changes. If an operator is possible, then the /= in

      a /=4/gi;
      

      is a single token (a compound assignment operator), and gi is a single identifier token. But if a regexp literal were possible at that point (and it's not, because regexp literals cannot follow identifiers), then the / and the = would be separate tokens, and so would g and i.

    2. Parsers which are built from a single set of productions are preferred by some programmers (but not the one who is writing this :-) ); they are usually called "scannerless parsers". In a scannerless parser for ECMAScript there would be no lexical feedback because there is no separate lexical analysis.

    3. There really is a breach between the theoretical purity of formal language theory and the practical details of writing a working parser of a real-life programming language. The theoretical models are really useful, and it would be hard to write a parser without knowing something about them. But very few parsers rigidly conform to the model, and that's OK. Similarly, the things which are popularly calle "regular expressions" aren't regular at all, in a formal language sense; some "regular expression" operators aren't even context-free (back-references). So it would be a huge mistake to assume that some theoretical result ("regular expressions can be identified in linear time and constant space") is actually true of a "regular expression" library. I don't think parsing theory is the only branch of computer science which exhibits this dichotomy.