Search code examples
parsinghaskelloperator-precedenceparsecassociativity

Right way to parse chain of various binary functions with `Parsec`?


It is true that Parsec has chainl and chainr to parse chains of either left-associative or right-associative operations (i.e. a -> a -> a). So I could quite easily parse something like x + y + z in a ((a + y) + z) or (a + (y + z)) manner.

However,

  1. there is no standard way to parse a -> b -> c functions and specific case when a = b: a -> a -> c, for example a = b = c thought as a comparison function (a -> a -> Bool);
  2. there is no standard way to implement "importance" of an operation: for example a + b = b + a should be parsed as ((a + b) = (b + a)) and not (((a + b) = b) + a)).

I am kind of new to parsing problems, so it would be great to get answers for both questions.


Solution

  • Okay, here's a long answer that might help. First, these are the imports I'm using, if you want to follow along:

    {-# LANGUAGE FlexibleContexts #-}
    {-# OPTIONS_GHC -Wall #-}
    import Control.Applicative (some)
    import Text.Parsec
    import Text.Parsec.Expr
    import Text.Parsec.String
    

    Why a -> a -> a isn't so bad...

    The operator type signature a -> a -> a is less restrictive and makes more sense than you might at first think. One key point is that usually when we're parsing expressions, we don't write a parser to evaluate them directly but rather parse them into some intermediate abstract syntax tree (AST) that is later evaluated. For example, consider a simple untyped AST with addition, subtraction, equality, and boolean connectives:

    data Expr
      = IntE Int        -- integer literals
      | FalseE | TrueE  -- boolean literals (F, T)
      | AddE Expr Expr  -- x + y
      | SubE Expr Expr  -- x - y
      | EqE  Expr Expr  -- x = y
      | OrE  Expr Expr  -- x | y
      | AndE Expr Expr  -- x & y
      deriving (Show)
    

    If we want to write a parser to treat all these operators as left associative at the same precedence level, we can write a chainl-based parser like so. (For simplicity, this parser doesn't permit whitespace.)

    expr :: Parser Expr
    expr = chainl1 term op
      where op = AddE <$ char '+'
             <|> SubE <$ char '-'
             <|> EqE  <$ char '='
             <|> OrE  <$ char '|'
             <|> AndE <$ char '&'
    term :: Parser Expr
    term = IntE . read <$> some digit
       <|> FalseE <$ char 'F' <|> TrueE <$ char 'T'
       <|> parens expr
    parens :: Parser a -> Parser a
    parens = between (char '(') (char ')')
    

    and we get:

    > parseTest expr "1+2+3"
    AddE (AddE (IntE 1) (IntE 2)) (IntE 3)
    > parseTest expr "1=2=F"
    EqE (EqE (IntE 1) (IntE 2)) FalseE
    >
    

    We'd then leave it up to the interpreter to deal with the types (i.e., to type check the program):

    data Value = BoolV Bool | IntV Int deriving (Eq, Show)
    eval :: Expr -> Value
    eval (IntE x) = IntV x
    eval FalseE = BoolV False
    eval TrueE = BoolV True
    eval (AddE e1 e2)
      = let IntV v1 = eval e1  -- pattern match ensures right type
            IntV v2 = eval e2
        in  IntV (v1 + v2)
    eval (SubE e1 e2)
      = let IntV v1 = eval e1
            IntV v2 = eval e2
        in  IntV (v1 - v2)
    eval (EqE e1 e2) = BoolV (eval e1 == eval e2)  -- equal if same type and value
    eval (OrE e1 e2)
      = let BoolV v1 = eval e1
            BoolV v2 = eval e2
        in  BoolV (v1 || v2)
    eval (AndE e1 e2)
      = let BoolV v1 = eval e1
            BoolV v2 = eval e2
        in  BoolV (v1 && v2)
    
    evalExpr :: String -> Value
    evalExpr str = let Right e = parse expr "<evalExpr>" str in eval e
    

    giving:

    > evalExpr "1+2+3"
    IntV 6
    > evalExpr "1=2=F"
    BoolV True
    >
    

    Note that even though the type of the "=" operator is something like Eq a => a -> a -> Bool (or actually a -> b -> Bool, as we allow comparison of unequal types), it's represented in the AST as the constructor EqE of type Expr -> Expr -> Expr, so the a -> a -> a type makes sense.

    Even if we were to combine the parser and evaluator above into a single function, we'd probably find it easiest to use a dynamic Value type, so all operators would be of type Value -> Value -> Value which fits into the a -> a -> a pattern:

    expr' :: Parser Value
    expr' = chainl1 term' op
      where op = add <$ char '+'
             <|> sub <$ char '-'
             <|> eq  <$ char '='
             <|> or  <$ char '|'
             <|> and <$ char '&'
            add (IntV x) (IntV y) = IntV $ x + y
            sub (IntV x) (IntV y) = IntV $ x - y
            eq  v1 v2 = BoolV $ v1 == v2
            or  (BoolV x) (BoolV y) = BoolV $ x || y
            and (BoolV x) (BoolV y) = BoolV $ x && y
    term' :: Parser Value
    term' = IntV . read <$> some digit
       <|> BoolV False <$ char 'F' <|> BoolV True <$ char 'T'
       <|> parens expr'
    

    This works too, with the parser directly evaluating the expression

    > parseTest expr' "1+2+3"
    IntV 6
    > parseTest expr' "1=2=F"
    BoolV True
    >
    

    You may find this use of dynamic typing during parsing and evaluation a little unsatifactory, but see below.

    Operator Precedence

    The standard way of adding operator precedence is to define multiple expression "levels" that work with a subset of the operators. If we want a precedence ordering from highest to lowest of addition/subtraction, then equality, then boolean "and", then boolean "or", we could replace expr' with the following. Note that each chainl1 call uses as "terms" the next (higher-precedence) expression level:

    expr0 :: Parser Value
    expr0 = chainl1 expr1 op
      where op = or  <$ char '|'
            or  (BoolV x) (BoolV y) = BoolV $ x || y
    expr1 :: Parser Value
    expr1 = chainl1 expr2 op
      where op = and <$ char '&'
            and (BoolV x) (BoolV y) = BoolV $ x && y
    expr2 :: Parser Value
    expr2 = chainl1 expr3 op
      where op = eq  <$ char '='
            eq  v1 v2 = BoolV $ v1 == v2
    expr3 :: Parser Value
    expr3 = chainl1 term'' op
      where op = add <$ char '+'  -- two operators at same precedence
             <|> sub <$ char '-'
            add (IntV x) (IntV y) = IntV $ x + y
            sub (IntV x) (IntV y) = IntV $ x - y
    term'' :: Parser Value
    term'' = IntV . read <$> some digit
         <|> BoolV False <$ char 'F' <|> BoolV True <$ char 'T'
         <|> parens expr0
    

    After which:

    > parseTest expr0 "(1+5-6=2-3+1&2+2=4)=(T|F)"
    BoolV True
    >
    

    As this can be tedious, Parsec provides a Text.Parsec.Expr that makes this easier. The following replaces expr0 through expr3 above:

    expr0' :: Parser Value
    expr0' = buildExpressionParser table term''
      where table = [ [binary '+' add, binary '-' sub]
                    , [binary '=' eq]
                    , [binary '&' and]
                    , [binary '|' or]
                    ]
            binary c op = Infix (op <$ char c) AssocLeft
            add (IntV x) (IntV y) = IntV $ x + y
            sub (IntV x) (IntV y) = IntV $ x - y
            eq  v1 v2 = BoolV $ v1 == v2
            and (BoolV x) (BoolV y) = BoolV $ x && y
            or  (BoolV x) (BoolV y) = BoolV $ x || y
    

    Typed Parsing

    You may find it strange above that we use an untyped AST (i.e., everything's an Expr) and dynamically typed Value instead of using Haskell's type system in the parsing. It is possible to design a parser where the operators actually have expected Haskell types. In the language above, equality causes a bit of an issue, but if we permit integer equality only, it's possible to write a typed parser/evaluator as follows. Here bexpr and iexpr are for boolean-valued and integer-values expressions respectively.

    bexpr0 :: Parser Bool
    bexpr0 = chainl1 bexpr1 op
      where op = (||) <$ char '|'
    bexpr1 :: Parser Bool
    bexpr1 = chainl1 bexpr2 op
      where op = (&&) <$ char '&'
    bexpr2 :: Parser Bool
    bexpr2 = False <$ char 'F' <|> True <$ char 'T'
         <|> try eqexpr
         <|> parens bexpr0
         where eqexpr = (==) <$> iexpr3 <* char '=' <*> iexpr3  -- this can't chain now
    iexpr3 :: Parser Int
    iexpr3 = chainl1 iterm op
      where op = (+) <$ char '+'
             <|> (-) <$ char '-'
    iterm :: Parser Int
    iterm = read <$> some digit
         <|> parens iexpr3
    

    Note that we're still able to use chainl1, but there's a boundary between the integer and boolean types enforced by precedence, so we only ever chain Int -> Int -> Int or Bool -> Bool -> Bool operators, and we don't let the Int -> Int -> Bool integer equality operator chain.

    This also means we need to use a different parser to parse a boolean versus an integer expression:

    > parseTest bexpr0 "1+2=3"
    True
    > parseTest iexpr3 "1+2-3"  -- iexpr3 is top-most integer expression parser
    0
    >
    

    Note here that if you wanted integer equality to chain as a set of equalities so that 1+1=2=3-1 would check that all three terms are equal, you could do this with chainl1 using some trickery with lists and singleton values, but it's easier to use sepBy1 and replace eqexpr above with the definition:

    eqexpr' = do
      x:xs <- sepBy1 iexpr3 (char '=')
      return $ all (==x) xs
    

    giving:

    > parseTest bexpr0 "1+1=2=3-1"
    True
    
    

    The whole program

    To summarize, here's all the code:

    {-# LANGUAGE FlexibleContexts #-}
    {-# OPTIONS_GHC -Wall #-}
    
    import Control.Applicative (some)
    import Text.Parsec
    import Text.Parsec.Expr
    import Text.Parsec.String
    
    -- * Untyped parser to AST
    
    data Expr
      = IntE Int        -- integer literals
      | FalseE | TrueE  -- boolean literals (F, T)
      | AddE Expr Expr  -- x + y
      | SubE Expr Expr  -- x - y
      | EqE  Expr Expr  -- x = y
      | OrE  Expr Expr  -- x | y
      | AndE Expr Expr  -- x & y
      deriving (Show)
    
    expr :: Parser Expr
    expr = chainl1 term op
      where op = AddE <$ char '+'
             <|> SubE <$ char '-'
             <|> EqE  <$ char '='
             <|> OrE  <$ char '|'
             <|> AndE <$ char '&'
    
    term :: Parser Expr
    term = IntE . read <$> some digit
       <|> FalseE <$ char 'F' <|> TrueE <$ char 'T'
       <|> parens expr
    
    parens :: Parser a -> Parser a
    parens = between (char '(') (char ')')
    
    -- * Interpreter
    
    data Value = BoolV Bool | IntV Int deriving (Eq, Show)
    
    eval :: Expr -> Value
    eval (IntE x) = IntV x
    eval FalseE = BoolV False
    eval TrueE = BoolV True
    eval (AddE e1 e2)
      = let IntV v1 = eval e1  -- pattern match ensures right type
            IntV v2 = eval e2
        in  IntV (v1 + v2)
    eval (SubE e1 e2)
      = let IntV v1 = eval e1
            IntV v2 = eval e2
        in  IntV (v1 - v2)
    eval (EqE e1 e2) = BoolV (eval e1 == eval e2)  -- equal if same type and value
    eval (OrE e1 e2)
      = let BoolV v1 = eval e1
            BoolV v2 = eval e2
        in  BoolV (v1 || v2)
    eval (AndE e1 e2)
      = let BoolV v1 = eval e1
            BoolV v2 = eval e2
        in  BoolV (v1 && v2)
    
    -- * Combined parser/interpreter with no intermediate AST
    
    expr' :: Parser Value
    expr' = chainl1 term' op
      where op = add <$ char '+'
             <|> sub <$ char '-'
             <|> eq  <$ char '='
             <|> or  <$ char '|'
             <|> and <$ char '&'
            add (IntV x) (IntV y) = IntV $ x + y
            sub (IntV x) (IntV y) = IntV $ x - y
            eq  v1 v2 = BoolV $ v1 == v2
            or  (BoolV x) (BoolV y) = BoolV $ x || y
            and (BoolV x) (BoolV y) = BoolV $ x && y
    term' :: Parser Value
    term' = IntV . read <$> some digit
       <|> BoolV False <$ char 'F' <|> BoolV True <$ char 'T'
       <|> parens expr'
    
    -- * Parser/interpreter with operator precendence
    
    expr0 :: Parser Value
    expr0 = chainl1 expr1 op
      where op = or  <$ char '|'
            or  (BoolV x) (BoolV y) = BoolV $ x || y
    expr1 :: Parser Value
    expr1 = chainl1 expr2 op
      where op = and <$ char '&'
            and (BoolV x) (BoolV y) = BoolV $ x && y
    expr2 :: Parser Value
    expr2 = chainl1 expr3 op
      where op = eq  <$ char '='
            eq  v1 v2 = BoolV $ v1 == v2
    expr3 :: Parser Value
    expr3 = chainl1 term'' op
      where op = add <$ char '+'  -- two operators at same precedence
             <|> sub <$ char '-'
            add (IntV x) (IntV y) = IntV $ x + y
            sub (IntV x) (IntV y) = IntV $ x - y
    term'' :: Parser Value
    term'' = IntV . read <$> some digit
         <|> BoolV False <$ char 'F' <|> BoolV True <$ char 'T'
         <|> parens expr0
    
    -- * Alternate implementation using buildExpressionParser
    
    expr0' :: Parser Value
    expr0' = buildExpressionParser table term''
      where table = [ [binary '+' add, binary '-' sub]
                    , [binary '=' eq]
                    , [binary '&' and]
                    , [binary '|' or]
                    ]
            binary c op = Infix (op <$ char c) AssocLeft
            add (IntV x) (IntV y) = IntV $ x + y
            sub (IntV x) (IntV y) = IntV $ x - y
            eq  v1 v2 = BoolV $ v1 == v2
            and (BoolV x) (BoolV y) = BoolV $ x && y
            or  (BoolV x) (BoolV y) = BoolV $ x || y
    
    -- * Typed parser/interpreter with separate boolean and integer expressions
    
    bexpr0 :: Parser Bool
    bexpr0 = chainl1 bexpr1 op
      where op = (||) <$ char '|'
    bexpr1 :: Parser Bool
    bexpr1 = chainl1 bexpr2 op
      where op = (&&) <$ char '&'
    bexpr2 :: Parser Bool
    bexpr2 = False <$ char 'F' <|> True <$ char 'T'
         <|> try eqexpr
         <|> parens bexpr0
         where eqexpr = (==) <$> iexpr3 <* char '=' <*> iexpr3  -- this can't chain now
    iexpr3 :: Parser Int
    iexpr3 = chainl1 iterm op
      where op = (+) <$ char '+'
             <|> (-) <$ char '-'
    iterm :: Parser Int
    iterm = read <$> some digit
         <|> parens iexpr3
    
    -- * Alternate definition of eqexpr to allow 4=2+2=1+3
    
    eqexpr' = do
      x:xs <- sepBy1 iexpr3 (char '=')
      return $ all (==x) xs