I have a strings like that: ***
, **(*)*
, ****(**(**)*)**
And I want to parse it in the data structure like that:
data Tree = Node [S] Tree Tree | Empty
where S
is *
(*
doesn't mean any char it is just star symbol)
I tried to build parser (I use megaparsec
but it is very similar to habitual parsec
):
data Tree = Node [Char] Tree Tree | Empty deriving (Show)
chain :: Parser Tree
chain = do
line <- many $ char '*'
branch <- between (char '(') (char ')') chain
cont <- (eof >> return Empty) <|> chain
return $ Node line branch cont
test = parseTest chain "****(*(*)***)*(*)**"
But It doesn't work. I tried many ways but can't struggle with it.
Let's start with a simpler test case:
> parseTest chain "*"
parse error at (line 1, column 2):
unexpected end of input
expecting "*" or "("
Note that there's a parse error after reading the first star. The input ended, but the parser expected to read either another star or an open parenthesis.
Looking at your parser, it's clear that:
line <- many $ char '*'
succeeds, by reading the first string of stars, but the next line:
branch <- between (char '(') (char ')') chain
requires an open parenthesis in the input, and this is not made optional in any way.
We could fix this by writing:
branch <- option Empty $ between (char '(') (char ')') chain
Now, the parser works okay on "***"
, but it hangs on "**(*)*"
. The problem is the line:
cont <- (eof >> return Empty) <|> chain
This tries to decide when to stop parsing based on detecting the end of input, but this only works at the top level chain
call where the end of the current tree corresponds to the end of the input -- in a recursive call, the tree can end before the input does, so this won't work.
Specifically, in the test case "**(*)*"
, when parsing the tree inside the parentheses, namely *
, we get line
set to *
, branch
set to Empty
, and then the cont
line sees that we aren't at the end of the input (since the rest of the input ")*"
is still to be read) and recursively calls chain
. In this recursive call, line
gets set to the empty string, branch
gets set to Empty
, and the cont
line again causes a recursive call to chain
, and we have an infinite loop.
Instead, let's write a parser tree
that parses the line
of a tree:
tree = do
line <- many $ char '*'
and now optionally a tree
in parenthesis (for the left hand side):
mleft <- optionMaybe $ between (char '(') (char ')') tree
If there is no left-hand side, then there can't be a right hand side either (convince yourself that this is true! -- try to write a tree that has no left hand side in parenthesis but still has a non-empty right hand side, and you'll see it can't be done), so we're finished:
case mleft of
Nothing -> return $ Node line Empty Empty
If there is a left-hand side, then read the right hand side tree (which might be empty, but that's okay) and return the node:
Just left -> do
right <- tree
return $ Node line left right
The whole parser looks like:
tree :: Parser Tree
tree = do
line <- many $ char '*'
mleft <- optionMaybe $ between (char '(') (char ')') tree
case mleft of
Nothing -> return $ Node line Empty Empty
Just left -> do
right <- tree
return $ Node line left right
and hopefully does what you expect:
> parseTest tree "*"
Node "*" Empty Empty
> parseTest tree "***"
Node "***" Empty Empty
> parseTest tree "**(*)*"
Node "**" (Node "*" Empty Empty) (Node "*" Empty Empty)
> parseTest tree "****(**(**)*)**"
Node "****" (Node "**" (Node "**" Empty Empty)
(Node "*" Empty Empty)) (Node "**" Empty Empty)
This parser just ignores trailing input:
> parseTest tree "*hello*"
Node "*" Empty Empty
but you can write a wrapper to require the end of the outermost tree correspond to the end of input:
treeOnly :: Parser Tree
treeOnly = tree <* eof