Search code examples
parsingtokenizelexer

Is it legitimate for a tokenizer to have a stack?


I have designed a new language for that I want to write a reasonable lexer and parser. For the sake of brevity, I have reduced this language to a minimum so that my questions are still open.

The language has implicit and explicit strings, arrays and objects. An implicit string is just a sequence of characters that does not contain <, {, [ or ]. An explicit string looks like <salt<text>salt> where salt is an arbitrary identifier (i.e. [a-zA-Z][a-zA-Z0-9]*) and text is an arbitrary sequence of characters that does not contain the salt.
An array starts with [, followed by objects and/or strings and ends with ]. All characters within an array that don't belong to an array, object or explicit string do belong to an implicit string and the length of each implicit string is maximal and greater than 0.
An object starts with { and ends with } and consists of properties. A property starts with an identifier, followed by a colon, then optional whitespaces and then either an explicit string, array or object.

So the string [ name:test <xml<<html>[]</html>>xml> {name:<b<test>b>}<b<bla>b> ] represents an array with 6 items: " name:test ", "<html>[]</html>", " ", { name: "test" }, "bla" and " " (the object is notated in json).

As one can see, this language is not context free due to the explicit string (that I don't want to miss). However, the syntax tree is nonambiguous.

So my question is: Is a property a token that may be returned by a tokenizer? Or should the tokenizer return T_identifier, T_colon when he reads an object property?
The real language allows even prefixes in the identifier of a property, e.g. ns/name:<a<test>a> where ns is the prefix for a namespace. Should the tokenizer return T_property_prefix("ns"), T_property_prefix_separator, T_property_name("name"), T_property_colon or just T_property("ns/name") or even T_identifier("ns"), T_slash, T_identifier("name"), T_colon?

If the tokenizer should recognize properties (which would be useful for syntax highlighters), he must have a stack, because name: is not a property if it is in an array. To decide whether bla: in [{foo:{bar:[test:baz]} bla:{}}] is a property or just an implicit string, the tokenizer must track when he enters and leave an object or array. Thus, the tokenizer would not be a finite state machine any more.

Or does it make sense to have two tokenizers - the first, which separates whitespaces from alpha-numerical character sequences and special characters like : or [, the second, which uses the first to build more semantical tokens? The parser could then operate on top of the second tokenizer.

Anyways, the tokenizer must have an infinite lookahead to see when an explicit string ends. Or should the detection of the end of an explicit string happen inside the parser?

Or should I use a parser generator for my undertaking? Since my language is not context free, I don't think there is an appropriate parser generator.

Thanks in advance for your answers!


Solution

  • flex can be requested to provide a context stack, and many flex scanners use this feature. So, while it may not fit with a purist view of how a scanner scans, it is a perfectly acceptable and supported feature. See this chapter of the flex manual for details on how to have different lexical contexts (called "start conditions"); at the very end is a brief description of the context stack. (Don't miss the sentence which notes that you need %option stack to enable the stack.) [See Note 1]

    Slightly trickier is the requirement to match strings with variable end markers. flex does not have any variable match feature, but it does allow you to read one character at a time from the scanner input, using the function input(). That's sufficient for your language (at least as described).

    Here's a rough outline of a possible scanner:

    %option stack
    %x SC_OBJECT
    
    %%
     /* initial/array context only */
    [^][{<]+ yylval = strdup(yytext); return STRING;
     /* object context only */
    <SC_OBJECT>{
      [}]                     yy_pop_state(); return '}';
      [[:alpha:]][[:alnum:]]* yylval = strdup(yytext); return ID;
      [:/]                    return yytext[0];
      [[:space:]]+            /* Ignore whitespace */
    }
     /* either context */
    <*>{
      [][]     return yytext[0]; /* char class with [] */
      [{]      yy_push_state(SC_OBJECT); return '{';
      "<"[[:alpha:]][[:alnum:]]*"<" {
               /* We need to save a copy of the salt because yytext could
                * be invalidated by input().
                */
               char* salt = strdup(yytext);
               char* saltend = salt + yyleng;
               char* match = salt;
               /* The string accumulator code is *not* intended
                * to be a model for how to write string accumulators.
                */
               yylval = NULL;
               size_t length = 0;
               /* change salt to what we're looking for */
               *salt = *(saltend - 1) = '>';
               while (match != saltend) {
                 int ch = input();
                 if (ch == EOF) {
                   yyerror("Unexpected EOF");
                   /* free the temps and do something */
                 }
                 if (ch == *match) ++match;
                 else if (ch == '>') match = salt + 1;
                 else match = salt;
                    /* Don't do this in real code */
                 yylval = realloc(yylval, ++length);
                 yylval[length - 1] = ch;
               }
               /* Get rid of the terminator */
               yylval[length - yyleng] = 0;
               free(salt);
               return STRING;
             }
      .      yyerror("Invalid character in object");
    }
    

    I didn't test that thoroughly, but here is what it looks like with your example input:

    [ name:test <xml<<html>[]</html>>xml> {name:<b<test>b>}<b<bla>b> ]
    Token: [
    Token: STRING: -- name:test --
    Token: STRING: --<html>[]</html>--
    Token: STRING: -- --
    Token: {
    Token: ID: --name--
    Token: :
    Token: STRING: --test--
    Token: }
    Token: STRING: --bla--
    Token: STRING: -- --
    Token: ]
    

    Notes

    1. In your case, unless you wanted to avoid having a parser, you don't actually need a stack since the only thing that needs to be pushed onto the stack is an object context, and a stack with only one possible value can be replaced with a counter.

      Consequently, you could just remove the %option stack and define a counter at the top of the scan. Instead of pushing the start condition, you increment the counter and set the start condition; instead of popping, you decrement the counter and reset the start condition if it drops to 0.

      %%
        /* Indented text before the first rule is inserted at the top of yylex */
        int object_count = 0;
      <*>[{]         ++object_count; BEGIN(SC_OBJECT); return '{';
      <SC_OBJECT[}]  if (!--object_count) BEGIN(INITIAL); return '}'
      
    2. Reading the input one character at a time is not the most efficient. Since in your case, a string terminate must start with >, it would probably be better to define a separate "explicit string" context, in which you recognized [^>]+ and [>]. The second of these would do the character-at-a-time match, as with the above code, but would terminate instead of looping if it found a non-matching character other than >. However, the simple code presented may turn out to be fast enough, and anyway it was just intended to be good enough to do a test run.