I have worked my way through the Haskell Koans provided here: https://github.com/roman/HaskellKoans
I am stuck on the last two Koans, both involving parsing custom algebraic data types. Here is the first:
data Atom = AInt Int | ASym Text deriving (Eq, Show)
testAtomParser :: Test
testAtomParser = testCase "atom parser" $ do
-- Change parser with the correct parser to use
--
let parser = <PARSER HERE> :: P.Parser Atom
assertParse (ASym "ab") $ P.parseOnly parser "ab"
assertParse (ASym "a/b") $ P.parseOnly parser "a/b"
assertParse (ASym "a/b") $ P.parseOnly parser "a/b c"
assertParse (AInt 54321) $ P.parseOnly parser "54321"
How can define the variable parser such that it can parse the algebraic datatype Atom
to pass the assertions?
Parsers of an ADT tend to reflect the shape of the ADT. Your ADT is formed of two disjoint parts, so your parser probably has two disjoint parts as well
atom = _ <|> _
Assuming we know how to parse a single digit (let's call that basic parser digit
) then we parse a (non-negative) integer by just repeating it.
natural = let loop = digit >> loop in loop
this successfully parses an infinite stream of digits and throws them away. Can we do better? Not with just a monad instance, unfortunately, we need another basic combinator, many
, which modifies some other parser to consume input 0 or more times, accumulating the results into a list. We'll actually adjust this slightly since an empty parse isn't a valid number
many1 p = do x <- p
xs <- many p
return (x:xs)
natural' = many1 digit
What about atoms? To pass the test cases, it appears that an atom must be 1-to-many alphanumeric characters or backslashes. Again, this disjoint structure can be immediately expressed in our parser
sym = many1 (_ <|> _)
We'll again use some built-in simple parser combinators to build up what we want, say satisfy :: (Char -> Bool) -> Parser Char
which matches any character which satisfies some predicate. We can immediately build another useful combinator, char c = satisfy (==c) :: Char -> Parser Char
and then we're done.
sym = many1 (char '/' <|> satisfy isAlpha)
where isAlpha
is a predicate much like the regex [a-zA-Z]
.
So now we have the core of our parser
natural <|> sym :: Parser String
the many1
combinators lift our character parsers into parsers of lists of characters (String
s!). This lifting action is the basic idea for building ADT parsers, too. We want to lift our Parser String
up into Parser Atom
. One way to do it would be to use a function toAtom :: String -> Atom
which we could then fmap
into the Parser
atom' :: Parser Atom
atom' = fmap toAtom (natural <|> sym)
but a function with type String -> Atom
defeats the purpose of building a parser in the first place.
As stated in I. the important part is that the shape of the ADT is reflected in the shape of our atom
parser. We'll need to take advantage of that to build our final parser.
We need to take advantage of information in the structure of our atom
parser. Let's instead build two functions
liftInt :: String -> Atom -- creates `AInt`s
liftSym :: String -> Atom -- creates `ASym`s
liftInt = AInt . read
liftSym = ASym
each of which stating both a method of turning String
s into Atom
s but also declaring what kind of Atom
we're dealing with. It's worth noting that liftInt
will throw a runtime error if we pass it a string that cannot be parsed into an Int
. Fortunately, that's exactly what we know we have.
atomInt :: Parser Atom
atomInt = liftInt <$> natural
atomSym :: Parser Sym
atomSym = liftSym <$> sym
atom'' = atomInt <|> atomSym
Now our atom''
parser takes advantage of the guarantee that natural
will only return strings which are valid parses for a natural---our call to read
will not fail!---and we try to build both AInt
and ASym
in order, trying one after another in a disjoint structure just like the structure of our ADT.
The whole shebang is thus
atom''' = AInt . read <$> many1 digit
<|> ASym <$> many1 ( char '/'
<|> satisfy isAlpha)
which shows the fun of parser combinators. The whole thing is built up from the ground using tiny, composable, simple pieces. Each one does a very tiny job but all together they span a large space of parsers.
You can also easily augment this grammar with more branches in your ADT, a more thoroughly specified symbol type parser, or failure decorations with <?>
so that you have great error messages on failed parses.