Search code examples
parsingyacclalr

Parse a block where each line starts with a specific symbol


I need to parse a block of code which looks like this:

* Block
| Line 1
| Line 2
| ...

It is easy to do:

block : head lines;
head  : '*' line;
lines : lines '|' line
      | '|' line
      ;

Now I wonder, how can I add nested blocks, e.g.:

* Block
| Line 1
| * Subblock
| | Line 1.1
| | ...
| Line 2
| ...

Can this be expressed as a LALR grammar?

I can, of course, parse the top-level blocks and than run my parser again to deal with each of these top-level blocks. However, I'm just learning this topic so it's interesting for me to avoid such approach.


Solution

  • The nested-block language is not context-free [Note 2], so it cannot be parsed with an LALR(k) parser.

    However, nested parenthesis languages are, of course, context-free and it is relatively easy to transform the input into a parenthetic form by replacing the initial | sequences in the lexical scanner. The transformation is simple:

    • when the initial sequence of |s is longer than the previous line, insert an BEGIN_BLOCK. (The initial sequence must be exactly one | longer; otherwise it is presumably a syntax error.)

    • when the initial sequence of |s, is shorter then the previous line, enough END_BLOCKs are inserted to bring the expected length to the correct value.

    • The |s themselves are not passed through to the parser.

    This is very similar to the INDENT/DEDENT strategy used to parse layout-aware languages like Python an Haskell. The main difference is that here we don't need a stack of indent levels.

    Once that transformation is finished, the grammar will look something like:

    content: /* empty */
           | content line
           | content block
    block  : head BEGIN_BLOCK content END_BLOCK
           | head
    head   : '*' line
    

    A rough outline of a flex implementation would be something like this: (see Note 1, below).

    %x INDENT CONTENT
    %%
      static int depth = 0, new_depth = 0;
      /* Handle pending END_BLOCKs */
      send_end:
        if (new_depth < depth) {
          --depth;
          return END_BLOCK;
      }
    ^"|"[[:blank:]]*   { new_depth = 1; BEGIN(INDENT); }
    ^.                 { new_depth = 0; yyless(0); BEGIN(CONTENT);
                         goto send_end; }
    ^\n                /* Ignore blank lines */
    <INDENT>{
      "|"[[:blank:]]*  ++new_depth;
      .                { yyless(0); BEGIN(CONTENT);
                         if (new_depth > depth) {
                           ++depth;
                           if (new_depth > depth) { /* Report syntax error */ }
                           return BEGIN_BLOCK;
                         } else goto send_end;
                       }
      \n               BEGIN(INITIAL); /* Maybe you care about this blank line? */
    }
      /* Put whatever you use here to lexically scan the lines */
    <CONTENT>{
      \n               BEGIN(INITIAL);
    }
    

    Notes:

    1. Not everyone will be happy with the goto but it saves some code-duplication. The fact that the state variable (depth and new_depth) are local static variables makes the lexer non-reentrant and non-restartable (after an error). That's only useful for toy code; for anything real, you should make the lexical scanner re-entrant and put the state variables into the extra data structure.

    2. The terms "context-free" and "context-sensitive" are technical descriptions of grammars, and are therefore a bit misleading. Intuitions based on what the words seem to mean are often wrong. One very common source of context-sensitivity is a language where validity depends on two different derivations of the same non-terminal producing the same token sequence. (Assuming the non-terminal could derive more than one token sequence; otherwise, the non-terminal could be eliminated.)

      There are lots of examples of such context-sensitivity in normal programming languages; usually, the grammar will allow these constructs and the check will be performed later in some semantic analysis phase. These include the requirement that an identifier be declared (two derivations of IDENTIFIER produce the same string) or the requirement that a function be called with the correct number of parameters (here, it is only necessary that the length of the derivations of the non-terminals match, but that is sufficient to trigger context-sensitivity).

      In this case, the requirement is that two instances of what might be called bar-prefix in consecutive lines produce the same string of |s. In this case, since the effect is really syntactic, deferring to a later semantic analysis defeats the point of parsing. Whether the other examples of context-sensitivity are "syntactic" or "semantic" is a debate which produces a surprising amount of heat without casting much light on the discussion.