Search code examples
javascriptparsingparser-combinators

Parsing an indentation based language using parsimmon library


My question is inspired by this one, but for javascript, using the parsimmon parser-combinator library. I want to parse an indentation-sensitive language, like python or yaml.

I've managed to convert the scala example in that answer to javascript easily enough - the key is the chain function in parsimmon, which is equivalent to the >> operator in scala's parser combinators - they both take a parser and a function that returns a parser, and the result of the first parser is passed to the function to choose the next parser.

However, I can't quite wrap my head around how to make this recursive. The example is for a single block - I don't see how to create nested blocks, tracking de-dent levels as needed to parse something like python.


Solution

  • Well, here's one way to do it. It's probably not the best, and it's definitely not intuitive (I'm not sure I understand why it works, and I wrote it :) but it seems to be pretty robust.

    Basically it says: a tree is a line optionally followed by a block. A block, in turn, is an indented list of trees.

    indent is a function that takes the current indentation level, and returns a parser that expects a line indented more than that. The returned parser returns a stack of previous indentation levels.

    I said before it's robust - in fact, it's too robust. It accepts input that really ought to throw an error instead: if you un-indent to an indentation level that doesn't match a previous level, it basically "rounds up" to the next indentation level. I'm not sure where the logic to fix that should go - mutual recursion mixed with parser "chain"-ing is hard to follow!

    var {regex, seq} = Parsimmon;
    
    function tree(state) {
        return seq(
            line,
            block(state).atMost(1).map(x => x[0]? x[0] : [])
        );
    }
    
    function block(state) { 
        return indent(state)
            .chain(tree).atLeast(1);
    }
    
    function indent(state) {
        return regex(/\s/).atLeast(state + 1)
            .map(x => x.length)
            .desc('indent');
    }
    
    let item = regex(/[^\s].*/).desc('item');
    let line = item.skip(regex(/\n?/));
    let start = block(-1);
    
    let result = start.parse('top item 1\n  sub item 1\n  sub item 2\n' + 
        '    even deeper\n  sub item 3\ntop item 2');
    console.log(JSON.stringify(result['value'], null, 2));
    <script src="https://cdn.rawgit.com/jneen/parsimmon/master/src/parsimmon.js"></script>