Search code examples
haskellparsec

Haskell Html Parser using Parsec


I'm currently writing a html parser in haskell. I'm using the parsec library. At this time the parser just considers standard elements with an opening and closing tag and no attributes. The code looks like this:

data Html = Element String [Html] | Content String
    deriving Show

element :: Parser Html
element = do
    name <- char '<' *> many1 letter <* char '>'
    children <- many $ (try element) <|> content
    string "</" >> string name >> char '>'
    return $ Element name children

content :: Parser Html
content = fmap Content $ many1 $ satisfy (\x -> x /='<')

If I'm using alphabetic and numeric characters for the content, all is working fine. But if I'm using the "less"-sign (<) I just get bad results. Thats why I excluded the "less"-sign for now. Does anybody have an idea how I can fix this behaviour? I tried different things but I just can't get it working.

Thanks and regards Philipp


Solution

  • Technically, something like <div>12 < 8</div> is invalid HTML. It should be written <div>12 &lt; 8</div> instead. (The example mentioned in the comments <div>12 > 8</div> is actually valid HTML, though it's more usual to write it escaped as <div>12 &gt; 8</div>.) However, I gather you aren't interested in writing a perfectly correct HTML parser and would like your parser to accept < characters within content that aren't part of a valid start or end tag.

    So, you'd like to accept each of these examples:

    <div>12 < 8</div>
    <p>x<y</div>
    <pre><<<>>></pre>
    

    but would probably like to reject:

    <p>x<y>z</p>
    

    on the basis that <y> is a valid start tag, but it's missing the matching </y> and also reject:

    <div>x</dvi>
    

    on the basis that </dvi> is an end tag that doesn't match the current active start tag.

    I would start by writing separate parsers for start and end tags:

    startTag :: Parser String
    startTag = char '<' *> many1 letter <* char '>'
    
    endTag :: Parser String
    endTag = string "</" *> many1 letter <* char '>'
    

    Then, write a parser for a String of text content. This can be tricky. Here's a straightforward implementation, even though it's not ideal performance-wise:

    contentString :: Parser String
    contentString = do
      -- fail if start or end tag (so caller will handle them)
      notFollowedBy startTag
      notFollowedBy endTag
      -- otherwise, parse either '<' as content, or some non-empty '<'-free text
      txt <- string "<" <|> many1 (noneOf "<")
      -- and possibly more text
      rest <- contentString <|> pure ""
      return (txt ++ rest)
    

    Note how we first ensure we aren't looking at a < that's part of a valid (start or end) tag. Then, we allow ourselves to parse a single non-tag < as content, or else some entirely <-free content, before looping to parse more. This is a tough function to get right, so testing is key. (It took me two or three tries before I got something that handled all my test cases.)

    Now, we can rewrite element to use the startTag and endTag parsers like so:

    element :: Parser Html
    element = do
        name <- startTag
        children <- many $ try element <|> Content <$> contentString
        name' <- endTag
        when (name /= name') $ unexpected ("</" ++ name' ++ ">, expected </" ++ name ++ ">")
        return $ Element name children
    

    and now we get:

    λ> parseTest element "<div>12 < 8</div>"
    Element "div" [Content "12 < 8"]
    λ> parseTest element "<div>x<y</div>"
    Element "div" [Content "x<y"]
    λ> parseTest element "<pre><<<>>></pre>"
    Element "pre" [Content "<<<>>>"]
    λ> parseTest element "<p>x<y>z</p>"
    parse error at (line 1, column 5):
    unexpected "y"
    expecting "</"
    λ> parseTest element "<div>x</dvi>"
    parse error at (line 1, column 13):
    unexpected </dvi>, expected </div>
    

    We can slightly improve error reporting in the fourth test case by fiddling with the trys in element:

    element :: Parser Html
    element = do
        -- add "try" here
        name <- try startTag
        -- remove "try" here
        children <- many $ element <|> Content <$> contentString
        name' <- endTag
        when (name /= name') $ unexpected ("</" ++ name' ++ ">, expected </" ++ name ++ ">")
        return $ Element name children
    

    which gives:

    λ> parseTest element "<p>x<y>z</p>"
    parse error at (line 1, column 13):
    unexpected </p>, expected </y>
    

    There's probably some more testing to do, but it seems to work okay on the above test cases, plus a couple more as given below. The full code:

    import Text.Parsec
    import Text.Parsec.String
    import Control.Monad
    
    data Html = Element String [Html] | Content String
        deriving Show
    
    startTag :: Parser String
    startTag = char '<' *> many1 letter <* char '>'
    
    endTag :: Parser String
    endTag = string "</" *> many1 letter <* char '>'
    
    element :: Parser Html
    element = do
      name <- try startTag
      children <- many $ element <|> Content <$> contentString
      name' <- endTag
      when (name /= name') $ unexpected ("</" ++ name' ++ ">, expected </" ++ name ++ ">")
      return $ Element name children
    
    contentString :: Parser String
    contentString = do
      -- fail if start or end tag (so caller will handle them)
      notFollowedBy startTag
      notFollowedBy endTag
      -- otherwise, parse either '<' as content, or some non-empty text
      txt <- string "<" <|> many1 (noneOf "<")
      -- and possibly more text
      rest <- contentString <|> pure ""
      return (txt ++ rest)
    
    main = do
      mapM_ (parseTest element)
        [ "<div>12 < 8</div>"
        , "<div>x<y</div>"
        , "<pre><<<>>></pre>"
        , "<p>x<y>z</p>"
        , "<div>x</dvi>"
        , "<table><tr><td>1</td><td>2</td></tr></table>"
        , "<empty></empty>"
        ]