Search code examples
flex-lexerlexjflex

How to invoke multiple lexical analyzers for the same text?


Let's say we have two code snippets:

long l = 0L;
string sqlStr = "SELECT Column1, Column2 FROM Table ORDER BY Column3 DESC";
int i = 0;
// ... C# code goes on

and

/// <summary>
/// This method does something.
/// </summary>
/// <param name="a">The a parameter.</param>
/// <param name="b">The b parameter.</param>
/// <exception cref="ArgumentNullException">
/// Thrown when <paramref name="b"/> is <c>null</c>.
/// </exception>
public static void DoSomething(int a, string b)
{
    // ... etc
}

And let's say I have written C#, SQL and XML scanners (I did not, but that's not the point of this question).

When this hypothetical C# scanner encounters sqlStr string contents with inline SQL, it has to invoke/switch to SQL scanner (Sounds weird, but bear with me; it has to), analyze contents as SQL language, then switch back and continue scanning C#.

Also, similarly, when XML documentation comment is found on DoSomething method, C# scanner must switch to XML scanner and produce XML tokens, then after that is done, continue as C#.

(Obviously, consuming application will distinguish between C#, SQL and XML tokens)

My questions:

  • Is this kind of gymnastics possible using JFlex/Flex/Lex? (Invoking or switching to different analyzer context)
  • Have you seen anything like this in the wild?

Solution

  • The (f)lex interface does not separate buffer management from lexical analysis, so there is no mechanism to just "call another lexer" in the middle of a scan. [Note 1]

    However, you can easily equip the generated scanner to handle different lexical contexts, using start conditions. Each condition defines a completely independent set of rules, so as long as the scanners can be made to use the same semantic value type (in C, this would involve expanding the semantic union) you can pack any number of different token sets into one generated grammar. [Note 2]

    The tricky part is getting the transitions right, and that would be a little tricky even if the interface were more cooperative. LALR(1) grammars, as generated by common parser generators, rely on advance reading of a lookahead token. When a parser action is executed, the next input token will normally already have been processed. So if a parser action invokes the scanner procedure to change state, that state change will often not take place until the token after the lookahead token.

    But even that cannot be guaranteed, because the parser might choose to execute a reduction action without asking for a lookahead token, in the case that the reduction woyld be performed regardless of what token follows. Bison implements this optimisation but not all parser generators do, and not all of them even document whether or not they do it.

    So if at all possible, it is useful to make the delimiter strings be the same token in both languages. If this is not possible, it might also be possible to write the grammar in such a way that both tokens have the same effect. There are a variety of other parser/scanner-specific hacks available depending on circumstances.

    Except in very rare cases, it will not be possible to trigger the transition in a scanner action, because that would require duplicating a significant part of the parse into the scanner. But in rare cases, when the delimiter strings are particularly identifiable, it may be possible.

    Examples "in the wild" of all these strategies are available, including in the code of flex itself (which needs to recognize regular expressions in one context and embedded C code in another one).


    Notes:

    1. It is hard to criticise a design decision made so long ago. It is totally understandable. But IMHO, it's really unfortunate that we still have to suffer from it. Control-flow inverted scanners would be useful for a variety of applications, such as online parsing from an input stream inside of an event loop.

    2. With flex, you don't have to make start conditions completely independent. Conditions can be "exclusive" (only patterns specifically marked as part of that condition apply) or non-exclusive (unmarked patterns also apply). Which option you take depends in part on how similar the tokenisations are.