Haskell Html Parser using Parsec

I'm currently writing a html parser in haskell. I'm using the parsec library. At this time the parser just considers standard elements with an opening and closing tag and no attributes. The code looks like this:

data Html = Element String [Html] | Content String
    deriving Show

element :: Parser Html
element = do
    name <- char '<' *> many1 letter <* char '>'
    children <- many $ (try element) <|> content
    string "</" >> string name >> char '>'
    return $ Element name children

content :: Parser Html
content = fmap Content $ many1 $ satisfy (\x -> x /='<')

If I'm using alphabetic and numeric characters for the content, all is working fine. But if I'm using the "less"-sign (<) I just get bad results. Thats why I excluded the "less"-sign for now. Does anybody have an idea how I can fix this behaviour? I tried different things but I just can't get it working.

Thanks and regards Philipp

Solution

Technically, something like <div>12 < 8</div> is invalid HTML. It should be written <div>12 < 8</div> instead. (The example mentioned in the comments <div>12 > 8</div> is actually valid HTML, though it's more usual to write it escaped as <div>12 > 8</div>.) However, I gather you aren't interested in writing a perfectly correct HTML parser and would like your parser to accept < characters within content that aren't part of a valid start or end tag.

So, you'd like to accept each of these examples:

<div>12 < 8</div>
<p>x<y</div>
<pre><<<>>></pre>

but would probably like to reject:

<p>x<y>z</p>

on the basis that <y> is a valid start tag, but it's missing the matching </y> and also reject:

<div>x</dvi>

on the basis that </dvi> is an end tag that doesn't match the current active start tag.

I would start by writing separate parsers for start and end tags:

startTag :: Parser String
startTag = char '<' *> many1 letter <* char '>'

endTag :: Parser String
endTag = string "</" *> many1 letter <* char '>'

Then, write a parser for a String of text content. This can be tricky. Here's a straightforward implementation, even though it's not ideal performance-wise:

contentString :: Parser String
contentString = do
  -- fail if start or end tag (so caller will handle them)
  notFollowedBy startTag
  notFollowedBy endTag
  -- otherwise, parse either '<' as content, or some non-empty '<'-free text
  txt <- string "<" <|> many1 (noneOf "<")
  -- and possibly more text
  rest <- contentString <|> pure ""
  return (txt ++ rest)

Note how we first ensure we aren't looking at a < that's part of a valid (start or end) tag. Then, we allow ourselves to parse a single non-tag < as content, or else some entirely <-free content, before looping to parse more. This is a tough function to get right, so testing is key. (It took me two or three tries before I got something that handled all my test cases.)

Now, we can rewrite element to use the startTag and endTag parsers like so:

element :: Parser Html
element = do
    name <- startTag
    children <- many $ try element <|> Content <$> contentString
    name' <- endTag
    when (name /= name') $ unexpected ("</" ++ name' ++ ">, expected </" ++ name ++ ">")
    return $ Element name children

and now we get:

λ> parseTest element "<div>12 < 8</div>"
Element "div" [Content "12 < 8"]
λ> parseTest element "<div>x<y</div>"
Element "div" [Content "x<y"]
λ> parseTest element "<pre><<<>>></pre>"
Element "pre" [Content "<<<>>>"]
λ> parseTest element "<p>x<y>z</p>"
parse error at (line 1, column 5):
unexpected "y"
expecting "</"
λ> parseTest element "<div>x</dvi>"
parse error at (line 1, column 13):
unexpected </dvi>, expected </div>

We can slightly improve error reporting in the fourth test case by fiddling with the trys in element:

element :: Parser Html
element = do
    -- add "try" here
    name <- try startTag
    -- remove "try" here
    children <- many $ element <|> Content <$> contentString
    name' <- endTag
    when (name /= name') $ unexpected ("</" ++ name' ++ ">, expected </" ++ name ++ ">")
    return $ Element name children

which gives:

λ> parseTest element "<p>x<y>z</p>"
parse error at (line 1, column 13):
unexpected </p>, expected </y>

There's probably some more testing to do, but it seems to work okay on the above test cases, plus a couple more as given below. The full code:

import Text.Parsec
import Text.Parsec.String
import Control.Monad

data Html = Element String [Html] | Content String
    deriving Show

startTag :: Parser String
startTag = char '<' *> many1 letter <* char '>'

endTag :: Parser String
endTag = string "</" *> many1 letter <* char '>'

element :: Parser Html
element = do
  name <- try startTag
  children <- many $ element <|> Content <$> contentString
  name' <- endTag
  when (name /= name') $ unexpected ("</" ++ name' ++ ">, expected </" ++ name ++ ">")
  return $ Element name children

contentString :: Parser String
contentString = do
  -- fail if start or end tag (so caller will handle them)
  notFollowedBy startTag
  notFollowedBy endTag
  -- otherwise, parse either '<' as content, or some non-empty text
  txt <- string "<" <|> many1 (noneOf "<")
  -- and possibly more text
  rest <- contentString <|> pure ""
  return (txt ++ rest)

main = do
  mapM_ (parseTest element)
    [ "<div>12 < 8</div>"
    , "<div>x<y</div>"
    , "<pre><<<>>></pre>"
    , "<p>x<y>z</p>"
    , "<div>x</dvi>"
    , "<table><tr><td>1</td><td>2</td></tr></table>"
    , "<empty></empty>"
    ]