Search code examples
haskellparsec

Parsec fails to parse if characters follow my string


I am trying to write something to parse my Django template, however my parser fails if anything follows an {% endblock %}

Here is what I have so far

import Control.Monad
import Text.ParserCombinators.Parsec


data Piece = StaticPiece String 
           | BlockPiece String [Piece]
           | VarPiece String
  deriving (Show)

noWhitespace = many1 $ oneOf "_" <|> alphaNum

parseBlock = do
  blockName <- between (string "{% block" >> spaces) (spaces >> string "%}") noWhitespace <?> "block tag"
  blockContent <- many (parsePiece (void $ try $ (string "{% endblock %}")))
  return $ BlockPiece blockName blockContent

parseVar = do
  var <- between (string "{{" >> spaces) (spaces >> string "}}") noWhitespace <?> "variable"
  return $ VarPiece var

parseStatic end = do
  s <- manyTill (anyChar) $ end <|> (void $ lookAhead $ try $ parseNonStatic)
  return $ StaticPiece s 

parseNonStatic = try parseBlock <|> parseVar
parsePiece s = try parseNonStatic <|> (parseStatic s)

parsePieces = manyTill (parsePiece eof) eof

main :: IO ()
main = do
  putStrLn "1"
  print $ parse parsePieces "" "Blah blah blah"
  putStrLn "2"
  print $ parse parsePieces "" "{{ some_var }} string {{ other_var }} s"
  putStrLn "3"
  print $ parse parsePieces "" "{% block body %}{% endblock %}"
  putStrLn "4"
  print $ parse parsePieces "" "{% block body %}{{ hello }}{% endblock %}"
  putStrLn "5"
  print $ parse parsePieces "" "{% block body %}{% {% endblock %}"
  putStrLn "6"
  print $ parse parseBlock ""  "{% block body %}{% endblock %} "
  putStrLn "7"
  print $ parse parsePieces "" "{% block body %} {} { {{ }{ {{{}} cool } {% block inner_body %} Hello: {{ hello }}{% endblock %} {% endblock %}"
  putStrLn "8"
  print $ parse parsePieces "" "{% block body %} {} {{ cool }} {% block inner_body %} Hello: {{ hello }}{% endblock %}{% endblock %} ldsakjf"
  print ">>"
  --
  print $ parse parseBlock ""  "{% block body %}{% endblock %} "

I am thinking that somehow instead of looking at the string from beginning to end, it is looking at it from the end somehow. If you look at #7 StaticPiece " " is inside the innermost block when it should be in the body block. Any help would be appreciated.

Edit the above code outputs:

1
Right [StaticPiece "Blah blah blah"]
2
Right [VarPiece "some_var",StaticPiece " string ",VarPiece "other_var",StaticPiece " s"]
3
Right [BlockPiece "body" [StaticPiece ""]]
4
Right [BlockPiece "body" [VarPiece "hello",StaticPiece ""]]
5
Right [BlockPiece "body" [StaticPiece "{% "]]
6
Left (line 1, column 32):
unexpected end of input
expecting "{% endblock %}", block tag or variable
7
Right [BlockPiece "body" [StaticPiece " {} { {{ }{ {{{}} cool } ",BlockPiece "inner_body" [StaticPiece " Hello: ",VarPiece "hello",StaticPiece "",StaticPiece " "]]]
8
Right [StaticPiece "{% block body %} {} ",VarPiece "cool",StaticPiece " {% block inner_body %} Hello: ",VarPiece "hello",StaticPiece "{% endblock %}{% endblock %} ldsakjf"]
">>"
Left (line 1, column 32):
unexpected end of input
expecting "{% endblock %}", block tag or variable

Solution

  • Let's rewrite some of the parsers to make things run smoothly.

    Use manyTill to parse blocks with matching endblock tags

    Firstly, we'll need to use parsers that match {% something or other %}, so let's make that a function:

    tag p = between (string "{%" >> spaces) (spaces >> string "%}") p <?> "tag"
    
    ghci> parse (tag $ string "any parser here") "" "{% any parser here %}"
    Right "any parser here"
    

    Let's use manyTill in parseBlock, to grab the endblock tag. I'm still using try, because tag (string "endblock") can fail having read some input, eg { at the start of a variable or other non-tag.

    parseBlock = do
      blockName <- tag (string "block" >> spaces >> noWhitespace) <?> "block tag"
      blockContent <- manyTill parsePiece (try $ tag $ string "endblock") 
      return $ BlockPiece blockName blockContent
    

    parseStatic mustn't match nothing, and should pause to check for tags/vars

    parseStatic is the source of most of the problems with this parser - it allows anything except a tag or var, which is always problematic - parsers are much better at following rules than being liberal.

    We need to stop parseStatic from just eating the remainder of the input, so that the nonstatic parsers get a chance to try again, so let's make a parser to peek at the next character without using it up in any way. Using a single character like this avoids lots of backtracking, although we'll see later there's some combining to do.

    peekChar = void . try . lookAhead .char 
    

    parseStatic also mustn't match the empty string - parsers that match the empty string aren't allowed to be used with any many combinator, because they would allow infinite parses like [StaticPiece "",StaticPiece "",StaticPiece ""..]. That's why we'll allow any character we like (including {) then as many characters as we like that aren't {. The only thing other than { that can terminate a StaticPiece is the end of the input, which is why eof is allowed here.

    parseStatic = do
      c <- anyChar
      s <- manyTill anyChar (peekChar '{' <|> eof)
      return $ StaticPiece (c:s) 
    
    ghci> parse parseStatic "" "some stuff not containing { other stuff"
    Right (StaticPiece "some stuff not containing ")
    

    So we get

    parsePieces = manyTill parsePiece eof
    

    Glue those statics together

    We now get nice parses like

    ghci> parse parsePieces "" "{{ some_var }} string {{ other_var }} s"
    Right [VarPiece "some_var",StaticPiece " string ",VarPiece "other_var",StaticPiece " s"]
    

    but also uglier ones like

    ghci> parse parsePieces "" "{% block body %} {} { {{ }{ {{{}} cool } {% block inner_body %} Hello: {{ hello }}{% endblock %} {% endblock %}"
    Right [BlockPiece "body" [StaticPiece " ",StaticPiece "{} ",StaticPiece "{ ",StaticPiece "{",StaticPiece "{ }",StaticPiece "{ ",StaticPiece "{",StaticPiece "{",StaticPiece "{}} cool } ",BlockPiece "inner_body" [StaticPiece " Hello: ",VarPiece "hello"],StaticPiece " "]]
    

    because parseStatic stops every time we hit {. Let's roll adjacent statics into one with a few helper functions:

    isStatic :: Piece -> Bool
    isStatic (StaticPiece _) = True
    isStatic _ = False
    
    unStatic :: Piece -> String
    unStatic (StaticPiece s) = s
    unStatic _ = error "unStatic: applied to something other than a StaticPiece"
    

    We'll use span :: (a -> Bool) -> [a] -> ([a], [a]) to collect up the non-statics and concat the statics:

    combineStatics :: [Piece] -> [Piece] 
    combineStatics pieces = let (nonstatics,therest) = span (not.isStatic) pieces in
        nonstatics ++ combine therest where
          combine [] = []
          combine ps = let (statics,more) = span isStatic ps in
            (StaticPiece . concat . map unStatic) statics : combineStatics more
    

    and rewrite parseBlock to combine any statics in its block content:

    parseBlock = do
      blockName <- tag (string "block" >> spaces >> noWhitespace) <?> "block tag"
      blockContent <- manyTill parsePiece (try $ tag $ string "endblock")
      return $ BlockPiece blockName (combineStatics blockContent)
    

    Now it works well

    The tests now run as I imagine you'd hope:

    1
    Right [StaticPiece "Blah blah blah"]
    2
    Right [VarPiece "some_var",StaticPiece " string ",VarPiece "other_var",StaticPiece " s"]
    3
    Right [BlockPiece "body" []]
    4
    Right [BlockPiece "body" [VarPiece "hello"]]
    5
    Right [BlockPiece "body" [StaticPiece "{% "]]
    6
    Right (BlockPiece "body" [])
    7
    Right [BlockPiece "body" [StaticPiece " {} { {{ }{ {{{}} cool } ",BlockPiece "inner_body" [StaticPiece " Hello: ",VarPiece "hello"],StaticPiece " "]]
    8
    Right [BlockPiece "body" [StaticPiece " {} ",VarPiece "cool",StaticPiece " ",BlockPiece "inner_body" [StaticPiece " Hello: ",VarPiece "hello"]],StaticPiece " ldsakjf"]
    ">>"
    Right (BlockPiece "body" [])