I am trying to parse a very simple language that consists of only decimal or binary numbers. For example, here are some valid inputs:
#b1
#d1
#b0101
#d1234
I am having a problem using Parsec's choice operator: <|>
. According to the tutorial: Write yourself a Scheme in 48 hours:
[The choice operator] tries the first parser, then if it fails, tries the second. If either succeeds, then it returns the value returned by that parser..
But in my experience, I see that the order of the parsers supplied matters. Here is my program:
import System.Environment
import Text.ParserCombinators.Parsec
main :: IO ()
main = do
(x:_) <- getArgs
putStrLn ( "Hello, " ++ readExp x)
bin :: Parser String
bin = do string "#b"
x <- many( oneOf "01")
return x
dec :: Parser String
dec = do string "#d"
x <- many( oneOf "0123456789")
return x
-- Why does order matter here?
parseExp = (bin <|> dec)
readExp :: String -> String
readExp input = case parse parseExp "test" input of
Left error -> "Error: " ++ show error
Right val -> "Found val" ++ show val
Here is how I am running the program:
$ cabal sandbox init
$ cabal install parsec
$ cabal exec ghc Main
$ ./Main "#d1"
Hello, Error: "test" (line 1, column 1):
unexpected "d"
expecting "#b"
$ ./Main "#b1"
Hello, Found val"1"
If I change the order of the parsers as follows:
parseExp = (dec <|> bin)
then only binary numbers are detected and the program fails to identify the decimal numbers.
With the tests that I have performed, I see this problem only happens when one of the parsers has started parsing an input e.g. if a hash character #
is found, the bin parser is activated ending up in failing as the next character expected is b
and not d
. It seems like there should be some kind of backtracking that should happen, which I am not aware of.
Appreciate the help!
Parsec has two kinds of "failure": there are failures that consume input, and failures that don't. To avoid backtracking (and therefore holding onto inputs longer than necessary/being generally unfriendly to the garbage collector), (<|>)
commits to the first parser as soon as it consumes input; so that if its first argument consumes input and fails, its second parser never gets a chance to succeed. You can explicitly request backtracking behavior with try
, thus:
Text.Parsec> parse (string "ab" <|> string "ac") "" "ac"
Left (line 1, column 1):
unexpected "c"
expecting "ab"
Text.Parsec> parse (try (string "ab") <|> string "ac") "" "ac"
Right "ac"
Unfortunately, try
has some pretty annoying performance penalties, which means that if you want a performant parser, you will have to refactor your grammar a bit. I would do that with the above parser this way:
Text.Parsec> parse (char 'a' >> (("ab" <$ char 'b') <|> ("ac" <$ char 'c'))) "" "ac"
Right "ac"
In your case, you will need to factor out the "#" mark similarly.