Search code examples
haskellindentationparsec

Make Text.Parsec.Indent parsing fail if unindent does not match any outer indentation level


I am trying to parse a string in Haskell which represents a tree. Each node is on a line, where the indentation determines the nesting (e.g. like the syntax of Python or Haskell).

A successful parsing result should be of the type [Tree] (a forest of rose trees), where

data Tree = Tree String [Tree]
          deriving Show

For example, the string

A
    B
    C
D
    E
        F

should result in

[Tree "A" [Tree "B" [], Tree "C" []], Tree "D" [Tree "E" [Tree "F" []]]]

With the Haskell indents package and the help of an article about how to parse indented trees, I could parse examples like this.

The problem is that my current implementation successfully parses the string

A
    B
        C
      D

as

[Tree "A" [Tree "B" [Tree "C" []]], Tree "D" []]

The node D is parsed as another root. However, I would like to make the parsing fail, since D is unindented relative to C, but it does not match the indentation level of A or B.

My implementation is as follows (simplifications: tree labels are nonempty and have alphanumeric characters only; not every kind of Unicode newline is supported).

import qualified Control.Applicative as A
import qualified Control.Monad as M
import qualified Text.Parsec as P
import qualified Text.Parsec.Indent as I

parse :: String -> Either P.ParseError [Tree]
parse = I.runIndent "" . P.runParserT forest () ""

forest = P.many tree A.<* P.eof

tree = spacing A.*> I.withBlock Tree node tree

spacing = P.many P.newline A.*> indentation

indentation = P.many $ P.char ' '

node = label A.<* spacing

label = P.many1 P.alphaNum A.<* lineEnd

lineEnd = M.void P.newline P.<|> P.eof

How can this parser be changed to accept only unindents which match some outer indentation level? Thanks in advance.


Solution

  • Simply replacing the definition of forest by

    forest = I.block tree A.<* P.eof
    

    seems to do the trick. I.block here makes sure that the roots of the source string are indented to the same level.

    The example

    A
        B
            C
          D
    

    now fails with

    (line 4, column 7):
    unexpected 'D'
    expecting " " or end of input
    not indented or indentation doesn't match
    

    The problem with P.many tree is that a tree can be arbitrarily indented. As soon as an unindentation of a tree does not match any outer indentation level (D in the example), the parser happily grows another top-level tree (root) of the forest instead of failing.