Search code examples
parsinghaskellgrammarcontext-free-grammarparsec

Using Parsec to parse regular expressions


I'm trying to learn Parsec by implementing a small regular expression parser. In BNF, my grammar looks something like:

EXP  : EXP *
     | LIT EXP
     | LIT

I've tried to implement this in Haskell as:

expr = try star
       <|> try litE
       <|> lit

litE  = do c <- noneOf "*"
           rest <- expr
           return (c : rest)

lit   = do c <- noneOf "*"
           return [c]

star = do content <- expr
          char '*'
          return (content ++ "*")

There are some infinite loops here though (e.g. expr -> star -> expr without consuming any tokens) which makes the parser loop forever. I'm not really sure how to fix it though, because the very nature of star is that it consumes its mandatory token at the end.

Any thoughts?


Solution

  • You should use Parsec.Expr.buildExprParser; it is ideal for this purpose. You simply describe your operators, their precedence and associativity, and how to parse an atom, and the combinator builds the parser for you!

    You probably also want to add the ability to group terms with parens so that you can apply * to more than just a single literal.

    Here's my attempt (I threw in |, +, and ? for good measure):

    import Control.Applicative
    import Control.Monad
    import Text.ParserCombinators.Parsec
    import Text.ParserCombinators.Parsec.Expr
    
    data Term = Literal Char
              | Sequence [Term]
              | Repeat (Int, Maybe Int) Term
              | Choice [Term]
      deriving ( Show )
    
    term :: Parser Term
    term = buildExpressionParser ops atom where
    
      ops = [ [ Postfix (Repeat (0, Nothing) <$ char '*')
              , Postfix (Repeat (1, Nothing) <$ char '+')
              , Postfix (Repeat (0, Just 1)  <$ char '?')
              ]
            , [ Infix (return sequence) AssocRight
              ]
            , [ Infix (choice <$ char '|') AssocRight
              ]
            ]
    
      atom = msum [ Literal <$> lit
                  , parens term
                  ]
    
      lit = noneOf "*+?|()"
      sequence a b = Sequence $ (seqTerms a) ++ (seqTerms b)
      choice a b = Choice $ (choiceTerms a) ++ (choiceTerms b)
      parens = between (char '(') (char ')')
    
      seqTerms (Sequence ts) = ts
      seqTerms t = [t]
    
      choiceTerms (Choice ts) = ts
      choiceTerms t = [t]
    
    main = parseTest term "he(llo)*|wor+ld?"