Search code examples
parsingsyntaxlanguage-designtreetoppeg

PEG for Python style indentation


How would you write a Parsing Expression Grammar in any of the following Parser Generators (PEG.js, Citrus, Treetop) which can handle Python/Haskell/CoffeScript style indentation:

Examples of a not-yet-existing programming language:

square x =
    x * x

cube x =
    x * square x

fib n =
  if n <= 1
    0
  else
    fib(n - 2) + fib(n - 1) # some cheating allowed here with brackets

Update: Don't try to write an interpreter for the examples above. I'm only interested in the indentation problem. Another example might be parsing the following:

foo
  bar = 1
  baz = 2
tap
  zap = 3

# should yield (ruby style hashmap):
# {:foo => { :bar => 1, :baz => 2}, :tap => { :zap => 3 } }

Solution

  • Pure PEG cannot parse indentation.

    But peg.js can.

    I did a quick-and-dirty experiment (being inspired by Ira Baxter's comment about cheating) and wrote a simple tokenizer.

    For a more complete solution (a complete parser) please see this question: Parse indentation level with PEG.js

    /* Initializations */
    {
      function start(first, tail) {
        var done = [first[1]];
        for (var i = 0; i < tail.length; i++) {
          done = done.concat(tail[i][1][0])
          done.push(tail[i][1][1]);
        }
        return done;
      }
    
      var depths = [0];
    
      function indent(s) {
        var depth = s.length;
    
        if (depth == depths[0]) return [];
    
        if (depth > depths[0]) {
          depths.unshift(depth);
          return ["INDENT"];
        }
    
        var dents = [];
        while (depth < depths[0]) {
          depths.shift();
          dents.push("DEDENT");
        }
    
        if (depth != depths[0]) dents.push("BADDENT");
    
        return dents;
      }
    }
    
    /* The real grammar */
    start   = first:line tail:(newline line)* newline? { return start(first, tail) }
    line    = depth:indent s:text                      { return [depth, s] }
    indent  = s:" "*                                   { return indent(s) }
    text    = c:[^\n]*                                 { return c.join("") }
    newline = "\n"                                     {}
    

    depths is a stack of indentations. indent() gives back an array of indentation tokens and start() unwraps the array to make the parser behave somewhat like a stream.

    peg.js produces for the text:

    alpha
      beta
      gamma
        delta
    epsilon
        zeta
      eta
    theta
      iota
    

    these results:

    [
       "alpha",
       "INDENT",
       "beta",
       "gamma",
       "INDENT",
       "delta",
       "DEDENT",
       "DEDENT",
       "epsilon",
       "INDENT",
       "zeta",
       "DEDENT",
       "BADDENT",
       "eta",
       "theta",
       "INDENT",
       "iota",
       "DEDENT",
       "",
       ""
    ]
    

    This tokenizer even catches bad indents.