Search code examples
lexerlexical-analysis

What's the specification of a lexer?


I was asked the following question:

Given two token rules:
    A := aaa
    B := aaaa

The output for text 'aaaaaaaaa' (nine a's) is
    B (aaaa); B (aaaa); unknown (a)
Rather than
    A (aaa); A (aaa); A (aaa)

I know this is the result of maximal munch principle. But how could I appropriately state the correct behavior of a lexer? Specifically, I was asked "why the output should be the first, rather than the second one seeming to be more reasonable?"

I tried to look for many literatures, but they only describe how the maximal munch algorithm works. Even so, it would be helpful if anyone can raise some references.


Solution

  • That's a good question.

    There is no formal definition of a lexical analyser; it's a pattern, not a standard. And the use of maximal munch is a heuristic, not an absolute rule. But it's a pretty good heuristic: maximal munch is almost always the Right Thing To Do.

    But "almost always" is not "always", and many languages have exceptions to maximal munch.

    One pretty common example, which goes back (at least) to Modula-2, is the range operator .., generally used with integer operands so that 0..99 indicates the range from 0 to 99. (The semantics varies from language to language.) In most languages, maximal munch would interpret 0..99 as two floating point constants (0. and .99) rather than the three tokens making up a range (0, .. and 99). So that has to be handled as an exception to maximal munch.

    C++ has another exception. When digraphs were added to the language, it became possible to write [ as <:; unlike trigraphs, digraphs are just tokens so maximal munch should apply. However, :: is used in C++ as a namespace separator, and ::name indicates "name in the top-level namespace" (as opposed to namespaces active because of a using declaration). Many already existing programs included template expansions using this syntax -- Template<::classname> -- and these would have become syntax errors if the <: were suddenly interpreted as the syntactic equivalent of [. So an exception was made to maximal munch for the sequence <::, which must be the two tokens < :: unless the next character is > or :. So we have <::a< :: a but <:::a<: :: a and <::><: :> (that is, []).

    And there are other examples.

    (F)lex handles these exceptions by using a slightly more flexible definition of "matching" which still depends on the maximal munch heuristic.

    A (f)lex pattern can contain a single (unparenthesized) / operator. / is usually called the "trailing context" operator, but that's not precise. The / does not change the characters matched; it's effect is to exclude the part of the match following the / from the matched token. So a (f)lex scanner for C++ might include the rules:

    <         { return T_LT; }
    <:        { return T_LT_COLON; }
    </::[^:>] { return T_LT; }
    

    As a result, the input <::a would match all three patterns. Because of maximal munch, the last rule will win, but the token returned from that match is just < rather than the full four-character match. That means that the next scan will start by rescanning ::a (resulting in the detection of a :: token). On the other hand, the input <:::a will not match the last pattern, so the scanner will have to backtrack to the longest pattern actually matched: <:.

    A similar trick can be used to ensure that 2..3 in a language with ranges (Swift, for example) does not match a floating point pattern:

    [[:digit:]]+"."[[:digit:]] { ... return T_FLOAT; }
    [[:digit:]]+/".."          { ... return T_INTEGER; }
    

    Again, the longest match for 2..3 is the second match, so maximal munch selects it. But the token returned is just 2; the trailing context is rescanned for the next token.

    However, there is another class of maximal munch exceptions which can only be handled by making lexical analysis dependent on syntactic context. For example, EcmaScript (and other languages, including Awk) allow regular expressions literals which start with a / character. However, / might also be a division operator or the first character of the /= operator. (It might also start a comment, but since a regular expression cannot start with * that case is unproblematic.) The language syntax effectively defines two mutually exclusive contexts:

    • in "operator" context, an infix or postfix operator is possible and an operand such as a variable or literal is not.
    • in "operand" context, an operand or prefix operator is possible, but other operators are not.

    So the parser always knows whether the next token might be a division operator, and when it might be a regular expression. But it doesn't have a good way to communicate that fact to the lexer. (The fact that the parser depends on lookahead tokens complicates the scenario further: for the parser to call the lexical analyser with an indication of whether to use the operator or regular expression rule, the parser would need to know that before it reads the lookahead token.)

    Another apparent set of syntactic exceptions to maximal munch are ambiguous double close brackets, of which probably the most infamous is the C++ template argument. when templates were first added to C++, we needed to be careful to avoid closing two open template specializations without an intervening space, to avoid >> being recognised as a right-shift operator. That was annoying, and there was no good need for it; there are very few expressions in which >> could be interpreted in two different ways, and in almost all such expressions, one of the possibilities is pretty contrived. So, to the relief of most C++ programmers (and the annoyance of C++ lexical analyser writers), the standard was eventually changed to allow >> to close two open template specializations instead of forcing it to be interpreted as a syntactically (or semantically) invalid right shift. (Note that this case doesn't have to be handled by the lexer. The lexer can return a single >> token, leaving it to the parser to use that single token to close two open specializations. That complicates the grammar a bit, but not unbearably.)

    So, maximal munch is not quite a universal. Moreover, from time to time proposals come up for lexers which can handle ambiguous lexicons by producing multiple possible token streams, much in the same way that GLR/GLL parsers can produce multiple possible parse trees. Indeed, one often-suggested possibility is the "scannerless" parser, in which lexical analysis is simply integrated into the grammar. If a GLR/GLL parser is available, the fact that lexical analysis tends to blow up the need for lookahead is no longer a show-stopper. (Allegedly, scannerless parsing works well with PEG grammars, as well.)

    On the other hand, processing parallel token streams can dramatically increase parsing time, even to the extent of requiring time quadratic in the length of the input (although this shouldn't happen with practical grammars). And I think that leads us back to the example in the original question.

    All of the examples presented above show deviations from maximal munch (to the extent that they are deviations from maximal munch) involving two consecutive tokens, which increases the complexity of lexical analysis but shouldn't have much impact on the computational complexity. But in the grammar suggested in the OP, recognising aaaaaaaaa as starting with a aaa token rather than aaaa requires looking ahead more than one token (since both aaaaaaaa and aaaaaaaaaa should start with aaaa).

    In addition to placing possible additional computational demands on the parser, this non-local disambiguation presents a cognitive load on the human reader (or writer), who is often best-served by a parser whose functioning is easy to predict, even if it sometimes fails to find a potential parse.

    The C example of a+++++b springs to mind; occasionally the question arises as to why this is not parsed as a++ + ++b, which is the only possible meaningful parse. I think the answer here is not so much "because maximal munch", as "because most code readers wouldn't see that parse". One of the great advantages of maximal munch, beyond its computational value, is that it is easy to explain and easy to predict.