I am trying to wrap my head around writing parser using parsec in Haskell, in particular how backtracking works.
Take the following simple parser:
import Text.Parsec
type Parser = Parsec String () String
parseConst :: Parser
parseConst = do {
x <- many digit;
return $ read x
}
parseAdd :: Parser
parseAdd = do {
l <- parseExp;
char '+';
r <- parseExp;
return $ l <> "+" <> r
}
parseExp :: Parser
parseExp = try parseConst <|> parseAdd
pp :: Parser
pp = parseExp <* eof
test = parse pp "" "1+1"
test
has value
Left (line 1, column 2):
unexpected '+'
expecting digit or end of input
In my mind this should succeed since I used the try
combinator on parseConst
in the definition of parseExp
.
What am I missing? I am also interrested in pointers for how to debug this in my own, I tried using parserTraced
which just allowed me to conclude that it indeed wasn't backtracking.
PS. I know this is an awful way to write an expression parser, but I'd like to understand why it doesn't work.
There are a lot of problems here.
First, parseConst
can never work right. The type says it must produce a String
, so read :: String -> String
. That particular Read
instance requires the input be a quoted string, so being passed 0 or more digit characters to read
is always going to result in a call to error
if you try to evaluate the value it produces.
Second, parseConst
can succeed on matching zero characters. I think you probably wanted some
instead of many
. That will make it actually fail if it encounters input that doesn't start with a digit.
Third, (<|>)
doesn't do what you think. You might think that (a <* c) <|> (b <* c)
is interchangeable with (a <|> b) <* c
, but it isn't. There is no way to throw try
in and make it the same, either. The problem is that (<|>)
commits to whichever branch succeed, if one does. In (a <|> b) <* c
, if a
matches, there's no way later to backtrack and try b
there. Doesn't matter how you lob try
around, it can't undo the fact that (<|>)
committed to a
. In contrast, (a <* c) <|> (b <* c)
doesn't commit until both a
and c
or b
and c
match the input.
This is the situation you're encountering. You have (try parseConst <|> parseAdd) <* eof
, after a bit of inlining. Since parseConst
will always succeed (see the second issue), parseAdd
will never get tried, even if the eof
fails. So after parseConst
consumes zero or more leading digits, the parse will fail unless that's the end of the input. Working around this essentially requires carefully planning your grammar such that any use of (<|>)
is safe to commit locally. That is, the contents of each branch must not overlap in a way that is disambiguated only by later portions of the grammar.
Note that this unpleasant behavior with (<|>)
is how the parsec family of libraries work, but not how all parser libraries in Haskell work. Other libraries work without the left bias or commit behavior the parsec family have chosen.