Search code examples
parsinghaskellparsec

Parsec `try` should backtrack


Isn't Parsec's try supposed to backtrack when it encounters failure?

For instance, if I have the code

import Control.Applicative ((<|>))
import Debug.Trace
import Text.Parsec (try)
import Text.Parsec.Combinator (eof)
import Text.Parsec.String (Parser)
import qualified Data.Char as Char
import qualified Text.Parsec as Parsec

data Func = Add | Sub deriving (Show)

add :: Parser Func
add = do
    Parsec.string "+"
    Parsec.notFollowedBy $ Parsec.satisfy (\_-> True)
    return Add

sub :: Parser Func
sub = do
    Parsec.string "-"
    Parsec.notFollowedBy $ Parsec.satisfy (\_-> True)
    return Sub

func :: Parser Func
func = add <|> sub

data AST = 
      Primitive String
    | BinOp Func AST AST
  deriving (Show)

primitive :: Parser AST
primitive = do
    str <- Parsec.many1 $ Parsec.satisfy $ not . Char.isSpace
    trace "primitive returning" $ return $ Primitive str

binOp :: Parser AST
binOp = do
    lhs <- parser
    Parsec.spaces
    operation <- func
    Parsec.spaces
    rhs <- parser
    return $ BinOp operation lhs rhs

parser :: Parser AST
parser = try primitive <|> binOp

sample :: String
sample = "+#-! + -$!&+"

main :: IO ()
main = print $ Parsec.parse (parser <* eof) "failure" sample

I get the output:

primitive returning
Left "failure" (line 1, column 5):
unexpected ' '
expecting end of input

However, from my understanding, when the parser encounters try primitive <|> binOp and primitive fails, it should go to the binOp option.

Desired parse is

BinOp Add (Primitive "+#-!") (Primitive "-$!&+")

Solution

  • The documentation for try doesn't actually use the word "backtrack". What it says is:

    The parser try p behaves like parser p, except that it pretends that it hasn't consumed any input when an error occurs.

    This may make little sense until you match it up with the documentation for the (<|>) operator which states:

    The parser p <|> q first applies p. If it succeeds, the value of p is returned. If p fails without consuming any input, parser q is tried.

    Note the emphasized phrase -- that's emphasis in the original documentation, not emphasis I've added.

    Here's how this leads to "backtracking". Consider a simplified version of your parser that allows no spaces, only single-letter primitives, and addition of two primitives as the only operation:

    ...
    import Text.Parsec as Parsec
    import Text.Parser.Char as Parsec
    ...
    
    add :: Parser Func
    add = do
      Parsec.string "+"
      return Add
    
    primitive :: Parser AST
    primitive = do
      c <- Parsec.letter
      return $ Primitive (c:"")
    
    binOp :: Parser AST
    binOp = do
      lhs <- primitive
      operation <- add
      rhs <- primitive
      return $ BinOp operation lhs rhs
    
    parser :: Parser AST
    parser = binOp <|> primitive
    

    This parser doesn't work. It parses "x+y" okay:

    λ> parseTest parser "x+y"
    BinOp Add (Primitive "x") (Primitive "y")
    

    but it fails to parse "x" alone:

    λ> parseTest parser "x"
    parse error at (line 1, column 2):
    unexpected end of input
    expecting "+"
    

    The problem in this second case is that the <|> operator tries binOp first. The binOp parser consumes a primitive, namely "x", and then fails to parse add (because there's no "+" character). As a result, binOp fails after consuming input. The <|> doesn't try its right-hand side alternative parser (primitive) because p <|> q only tries q if p fails without consuming input, and that's not what happened here. In other words, by default, after the left-hand side of the alternative consumes some input and then fails, there is no "backtracking" to put that consumed input back.

    The try combinator changes this behavior by "undoing" the consumption of input in case of failure. If we fix the parser:

    parser :: Parser AST
    parser = try binOp <|> primitive
    

    then it works fine on the input "x":

    λ> parseTest parser "x"
    Primitive "x"
    

    As before, the <|> operator tries the try binOp parser first. The try binOp parser acts just like binOp -- it consumes a primitive and then fails to parse add, resulting in a "failure after consuming input". But try "backtracks" by replacing this with a "failure without consuming any input" (essentially, putting the input back). This satisfies the requirement for <|> to try the alternative parser primitive which, of course, successfully parses "x", and that's the final result of the parse.

    How is this different from your situation? Well, you've got the equivalent of:

    parser :: Parser AST
    parser = try primitive <|> binOp
    

    This parser is broken, too. It works on "x":

    λ> parseTest parser "x"
    Primitive "x"
    

    and it "works" on "x+y" but returns the wrong answer:

    λ> parseTest parser "x+y"
    Primitive "x"
    

    The problem is that the <|> first tries the try primitive parser, which acts just like primitive. Even in the second test case, it succeeds after consuming the primitive "x" (leaving "+y" yet to be parsed). Note that try has no effect, because the only thing try does is take "failure after consuming input" and replace it with "failure without consuming any input", but primitive doesn't fail! It succeeds! So, there's no backtracking because there's no failure to backtrack from. The result of parser is Primitive "x", and now there's some unparseable garbage "+y" left that will trigger a failure in some other parser that follows parser, like eof.

    So, to start to fix your parser, you could replace the definition of parser with:

    parser :: Parser AST
    parser = try binOp <|> primitive
    

    Here, the try binOp will start to parse a binOp, and if it fails part way through (e.g., after parsing a primitive but not finding an operator), then try will trigger backtracking so the primitive parser can be tried instead.

    Unfortunately, this will cause an infinite loop in your parser because parser starts by trying binOp which starts by invoking parser (for the left-hand side) which starts by trying binOp, etc., etc., and neither parser makes any progress.

    There are a few solutions to this problem. There is machinery in Text.Parsec.Expr for handling expressions in general, and there are functions like sepBy or chainl1 for correctly parsing a sequence of terms separated by operators (without creating an infinite loop).

    If you want to do it manually, you'll end up having to do something like:

    parser :: Parser AST
    parser = do
      -- there's always a primitive
      lhs <- primitive
      Parsec.spaces
      -- it may be followed by some binOp/primitives
      followedByBinOp lhs
    
    followedByBinOp :: AST -> Parser AST
    followedByBinOp lhs =
      -- if there's another binop/primitive
      (do operator <- func
          Parsec.spaces
          rhs <- primitive
          Parsec.spaces
          -- then loop with a new LHS
          followedByBinOp (BinOp operator lhs rhs)
      -- otherwise, return what we've got
      <|> return lhs)
    

    This is super ugly, which is why chainl1 was invented. The above can be replaced with:

    parser :: Parser AST
    parser = Parsec.chainl1 (primitive <* Parsec.spaces) (BinOp <$> func <* Parsec.spaces)
    

    With either of these versions of parser, your parser still won't work because of two more problems. First, your primitive doesn't know when to stop:

    primitive :: Parser AST
    primitive = do
        str <- Parsec.many1 $ Parsec.satisfy $ not . Char.isSpace
        trace "primitive returning" $ return $ Primitive str
    

    Here, the only way to end a primitive is with a space. That means that the expression "x+y+z" is a valid single primitive. If you try to parse this with parser, the initial lhs <- primitive will eat the whole string, and you'll get back Primitive "x+y+z". If you don't mind spaces being mandatory, then you'll be able to get it to parse "x + y + z", but if you'd like spaces to be optional, you'll need to somehow restrict the allowed primitives. (There's a reason most languages doesn't allow completely arbitrary identifiers. If x+y or -- worse -- +#-! -- is a valid identifier, it's hard to parse expressions with infix operators.)

    If you allow only alphanumeric strings as primitives, that'll get you a little closer:

    primitive :: Parser AST
    primitive = do
        str <- Parsec.many1 $ Parsec.satisfy $ Char.isAlphaNum
        return $ Primitive str
    

    However, your parser still won't work because add and sub are misusing notFollowedBy. Consider add:

    add :: Parser Func
    add = do
        Parsec.string "+"
        Parsec.notFollowedBy $ Parsec.satisfy (\_-> True)
        return Add
    

    This works by first parsing a "+" character. Then, it tries to parse another character. If that character satisfies the predicate (which it does because it's always True), then the whole parse will fail after consuming input. That's because notFollowedBy is designed to fail if its parser succeeds. So, except in the case where a "+" occurs at the end of the string, this add parser always fails after consuming input, aborting the entire parse.

    For now, at least, it's probably best to just delete those notFollowedBy lines.

    Anyway, here's a modified version of your parser that more or less works, but only accepts alphanumeric primitives. It actually doesn't need to backtrack, so it doesn't use try anywhere:

    import Control.Applicative ((<|>))
    import Text.Parsec.Combinator (eof)
    import Text.Parsec.String (Parser)
    import qualified Data.Char as Char
    import qualified Text.Parsec as Parsec
    
    data Func = Add | Sub deriving (Show)
    
    add :: Parser Func
    add = do
        Parsec.string "+"
        -- Parsec.notFollowedBy $ Parsec.satisfy (\_-> True)
        return Add
    
    sub :: Parser Func
    sub = do
        Parsec.string "-"
        -- Parsec.notFollowedBy $ Parsec.satisfy (\_-> True)
        return Sub
    
    func :: Parser Func
    func = add <|> sub
    
    data AST =
          Primitive String
        | BinOp Func AST AST
      deriving (Show)
    
    primitive :: Parser AST
    primitive = do
        str <- Parsec.many1 $ Parsec.satisfy $ Char.isAlphaNum
        return $ Primitive str
    
    parser :: Parser AST
    parser = do
      -- there's always a primitive
      lhs <- primitive
      Parsec.spaces
      -- it may be followed by some binOp/primitives
      followedByBinOp lhs
    
    followedByBinOp :: AST -> Parser AST
    followedByBinOp lhs =
      -- if there's another binop/primitive
      (do operator <- func
          Parsec.spaces
          rhs <- primitive
          Parsec.spaces
          -- then loop with a new LHS
          followedByBinOp (BinOp operator lhs rhs)
      -- otherwise, return what we've got
      <|> return lhs)
    
    sample :: String
    sample = "xyz + 15 - 123abc"
    
    main :: IO ()
    main = print $ Parsec.parse (parser <* eof) "failure" sample