Search code examples
haskellparsecmegaparsec

parsec how to recursively parse simple expression?


I have a strings like that: ***, **(*)*, ****(**(**)*)**

And I want to parse it in the data structure like that:

data Tree = Node [S] Tree Tree | Empty where S is * (* doesn't mean any char it is just star symbol)

I tried to build parser (I use megaparsec but it is very similar to habitual parsec):

data Tree = Node [Char] Tree Tree | Empty deriving (Show)

chain :: Parser Tree 
chain = do
    line <- many $ char '*'
    branch <- between (char '(') (char ')') chain
    cont <- (eof >> return Empty) <|> chain
    return $ Node line branch cont

test = parseTest chain "****(*(*)***)*(*)**"

But It doesn't work. I tried many ways but can't struggle with it.


Solution

  • Let's start with a simpler test case:

    > parseTest chain "*"
    parse error at (line 1, column 2):
    unexpected end of input
    expecting "*" or "("
    

    Note that there's a parse error after reading the first star. The input ended, but the parser expected to read either another star or an open parenthesis.

    Looking at your parser, it's clear that:

    line <- many $ char '*'
    

    succeeds, by reading the first string of stars, but the next line:

    branch <- between (char '(') (char ')') chain
    

    requires an open parenthesis in the input, and this is not made optional in any way.

    We could fix this by writing:

    branch <- option Empty $ between (char '(') (char ')') chain
    

    Now, the parser works okay on "***", but it hangs on "**(*)*". The problem is the line:

    cont <- (eof >> return Empty) <|> chain
    

    This tries to decide when to stop parsing based on detecting the end of input, but this only works at the top level chain call where the end of the current tree corresponds to the end of the input -- in a recursive call, the tree can end before the input does, so this won't work.

    Specifically, in the test case "**(*)*", when parsing the tree inside the parentheses, namely *, we get line set to *, branch set to Empty, and then the cont line sees that we aren't at the end of the input (since the rest of the input ")*" is still to be read) and recursively calls chain. In this recursive call, line gets set to the empty string, branch gets set to Empty, and the cont line again causes a recursive call to chain, and we have an infinite loop.

    Instead, let's write a parser tree that parses the line of a tree:

    tree = do
      line <- many $ char '*'
    

    and now optionally a tree in parenthesis (for the left hand side):

      mleft  <- optionMaybe $ between (char '(') (char ')') tree
    

    If there is no left-hand side, then there can't be a right hand side either (convince yourself that this is true! -- try to write a tree that has no left hand side in parenthesis but still has a non-empty right hand side, and you'll see it can't be done), so we're finished:

      case mleft of
        Nothing -> return $ Node line Empty Empty
    

    If there is a left-hand side, then read the right hand side tree (which might be empty, but that's okay) and return the node:

        Just left -> do
          right <- tree
          return $ Node line left right
    

    The whole parser looks like:

    tree :: Parser Tree
    tree = do
      line <- many $ char '*'
      mleft  <- optionMaybe $ between (char '(') (char ')') tree
      case mleft of
        Nothing -> return $ Node line Empty Empty
        Just left -> do
          right <- tree
          return $ Node line left right
    

    and hopefully does what you expect:

    > parseTest tree "*"
    Node "*" Empty Empty
    > parseTest tree "***"
    Node "***" Empty Empty
    > parseTest tree "**(*)*"
    Node "**" (Node "*" Empty Empty) (Node "*" Empty Empty)
    > parseTest tree "****(**(**)*)**"
    Node "****" (Node "**" (Node "**" Empty Empty)
        (Node "*" Empty Empty)) (Node "**" Empty Empty)
    

    This parser just ignores trailing input:

    > parseTest tree "*hello*"
    Node "*" Empty Empty
    

    but you can write a wrapper to require the end of the outermost tree correspond to the end of input:

    treeOnly :: Parser Tree
    treeOnly = tree <* eof