Parsing Haskell custom data types

I have worked my way through the Haskell Koans provided here: https://github.com/roman/HaskellKoans

I am stuck on the last two Koans, both involving parsing custom algebraic data types. Here is the first:

data Atom = AInt Int | ASym Text deriving (Eq, Show)

testAtomParser :: Test
testAtomParser = testCase "atom parser" $ do
    -- Change parser with the correct parser to use
    --
    let parser = <PARSER HERE> :: P.Parser Atom
    assertParse (ASym "ab") $ P.parseOnly parser "ab"
    assertParse (ASym "a/b") $ P.parseOnly parser "a/b"
    assertParse (ASym "a/b") $ P.parseOnly parser "a/b c"
    assertParse (AInt 54321) $ P.parseOnly parser "54321"

How can define the variable parser such that it can parse the algebraic datatype Atom to pass the assertions?

Solution

I.

Parsers of an ADT tend to reflect the shape of the ADT. Your ADT is formed of two disjoint parts, so your parser probably has two disjoint parts as well

atom = _ <|> _

II.

Assuming we know how to parse a single digit (let's call that basic parser digit) then we parse a (non-negative) integer by just repeating it.

natural = let loop = digit >> loop in loop

this successfully parses an infinite stream of digits and throws them away. Can we do better? Not with just a monad instance, unfortunately, we need another basic combinator, many, which modifies some other parser to consume input 0 or more times, accumulating the results into a list. We'll actually adjust this slightly since an empty parse isn't a valid number

many1 p = do x  <- p
             xs <- many p
             return (x:xs)

natural' = many1 digit

III.

What about atoms? To pass the test cases, it appears that an atom must be 1-to-many alphanumeric characters or backslashes. Again, this disjoint structure can be immediately expressed in our parser

sym = many1 (_ <|> _)

We'll again use some built-in simple parser combinators to build up what we want, say satisfy :: (Char -> Bool) -> Parser Char which matches any character which satisfies some predicate. We can immediately build another useful combinator, char c = satisfy (==c) :: Char -> Parser Char and then we're done.

sym = many1 (char '/' <|> satisfy isAlpha)

where isAlpha is a predicate much like the regex [a-zA-Z].

IV.

So now we have the core of our parser

natural <|> sym :: Parser String

the many1 combinators lift our character parsers into parsers of lists of characters (Strings!). This lifting action is the basic idea for building ADT parsers, too. We want to lift our Parser String up into Parser Atom. One way to do it would be to use a function toAtom :: String -> Atom which we could then fmap into the Parser

atom' :: Parser Atom
atom' = fmap toAtom (natural <|> sym)

but a function with type String -> Atom defeats the purpose of building a parser in the first place.

As stated in I. the important part is that the shape of the ADT is reflected in the shape of our atom parser. We'll need to take advantage of that to build our final parser.

V.

We need to take advantage of information in the structure of our atom parser. Let's instead build two functions

liftInt :: String -> Atom  -- creates `AInt`s
liftSym :: String -> Atom  -- creates `ASym`s

liftInt = AInt . read
liftSym = ASym

each of which stating both a method of turning Strings into Atoms but also declaring what kind of Atom we're dealing with. It's worth noting that liftInt will throw a runtime error if we pass it a string that cannot be parsed into an Int. Fortunately, that's exactly what we know we have.

atomInt :: Parser Atom
atomInt = liftInt <$> natural

atomSym :: Parser Sym
atomSym = liftSym <$> sym

atom'' = atomInt <|> atomSym

Now our atom'' parser takes advantage of the guarantee that natural will only return strings which are valid parses for a natural---our call to read will not fail!---and we try to build both AInt and ASym in order, trying one after another in a disjoint structure just like the structure of our ADT.

VI.

The whole shebang is thus

atom''' =     AInt . read <$> many1 digit
          <|> ASym <$> many1 (    char '/' 
                              <|> satisfy isAlpha)

which shows the fun of parser combinators. The whole thing is built up from the ground using tiny, composable, simple pieces. Each one does a very tiny job but all together they span a large space of parsers.

You can also easily augment this grammar with more branches in your ADT, a more thoroughly specified symbol type parser, or failure decorations with <?> so that you have great error messages on failed parses.