Search code examples
parsinghaskellparsecmegaparsechaskell-parsing

Using makeExprParser with ambiguity


I'm currently encountering a problem while translating a parser from a CFG-based tool (antlr) to Megaparsec.

The grammar contains lists of expressions (handled with makeExprParser) that are enclosed in brackets (<, >) and separated by ,.

Stuff like <>, <23>, <23,87> etc.

The problem now is that the expressions may themselves contain the > operator (meaning "greater than"), which causes my parser to fail. <1223>234> should, for example, be parsed into [BinaryExpression ">" (IntExpr 1223) (IntExpr 234)].

I presume that I have to strategically place try somewhere, but the places I tried (to the first argument of sepBy and the first argument of makeExprParser) did unfortunately not work.

Can I use makeExprParser in such a situation or do I have to manually write the expression parser?:

This is the relevant part of my parser:

-- uses megaparsec, text, and parser-combinators
{-# LANGUAGE OverloadedStrings #-}

module Main where

import Control.Monad.Combinators.Expr
import Data.Text
import Data.Void
import System.Environment
import Text.Megaparsec
import Text.Megaparsec.Char
import qualified Text.Megaparsec.Char.Lexer as L

type BinaryOperator = Text

type Name = Text

data Expr
  = IntExpr Integer
  | BinaryExpression BinaryOperator Expr Expr
  deriving (Eq, Show)

type Parser = Parsec Void Text

lexeme :: Parser a -> Parser a
lexeme = L.lexeme sc

symbol :: Text -> Parser Text
symbol = L.symbol sc

sc :: Parser ()
sc = L.space space1 (L.skipLineComment "//") (L.skipBlockCommentNested "/*" "*/")

parseInteger :: Parser Expr
parseInteger = do
  number <- some digitChar
  _ <- sc
  return $ IntExpr $ read number

parseExpr :: Parser Expr
parseExpr = makeExprParser parseInteger [[InfixL (BinaryExpression ">" <$ symbol ">")]]

parseBracketList :: Parser [Expr]
parseBracketList = do
  _ <- symbol "<"
  exprs <- sepBy parseExpr (symbol ",")
  _ <- symbol ">"
  return exprs

main :: IO ()
main = do
  text : _ <- getArgs
  let res = runParser parseBracketList "stdin" (pack text)
  case res of
    (Right suc) -> do
      print suc
    (Left err) ->
      putStrLn $ errorBundlePretty err

Solution

  • You've (probably) misdiagnosed the problem. Your parser fails on <1233>234> because it's trying to parse > as a left associative operator, like +. In other words, the same way:

    1+2+
    

    would fail, because the second + has no right-hand operand, your parser is failing because:

    1233>234>
    

    has no digit following the second >. Assuming you don't want your > operator to chain (i.e., 1>2>3 is not a valid Expr), you should first replace InfixL with InfixN (non-associative) in your makeExprParser table. Then, it will parse this example fine.

    Unfortunately, with or without this change your parser will still fail on the simpler test case:

    <1233>
    

    because the > is interpreted as an operator within a continuing expression.

    In other words, the problem isn't that your parser can't handle expressions with > characters, it's that it's overly aggressive in treating > characters as part of an expression, preventing them from being recognized as the closing angle bracket.

    To fix this, you need to figure out exactly what you're parsing. Specifically, you need to resolve the ambiguity in your parser by precisely characterizing the situations where > can be part of a continuing expression and where it can't.

    One rule that will probably work is to only consider a > as an operator if it is followed by a valid "term" (i.e., a parseInteger). You can do this with lookAhead. The parser:

    symbol ">" <* lookAhead term
    

    will parse a > operator only if it is followed by a valid term. If it fails to find a term, it will consume some input (at least the > symbol itself), so you must surround it with a try:

    try (symbol ">" <* lookAhead term)
    

    With the above two fixes applied to parseExpr:

    parseExpr :: Parser Expr
    parseExpr = makeExprParser term
      [[InfixN (BinaryExpression ">" <$ try (symbol ">" <* lookAhead term))]]
      where term = parseInteger
    

    you'll get the following parses:

    λ> parseTest parseBracketList "<23>"
    [IntExpr 23]
    λ> parseTest parseBracketList "<23,87>"
    [IntExpr 23,IntExpr 87]
    λ> parseTest parseBracketList "<23,87>18>"
    [IntExpr 23,BinaryExpression ">" (IntExpr 87) (IntExpr 18)]
    

    However, the following will fail:

    λ> parseTest parseBracketList "<23,87>18"
    1:10:
      |
    1 | <23,87>18
      |          ^
    unexpected end of input
    expecting ',', '>', or digit
    λ> 
    

    because the fact that the > is followed by 18 means that it is a valid operator, and it is parse failure that the valid expression 87>18 is followed by neither a comma nor a closing > angle bracket.

    If you need to parse something like <23,87>18, you have bigger problems. Consider the following two test cases:

    <1,2>3,4,5,6,7,...,100000000000,100000000001>
    <1,2>3,4,5,6,7,...,100000000000,100000000001
    

    It's a challenge to write an efficient parser that will parse the first one as a list of 10000000000 expressions but the second one as a list of two expression:

    [IntExpr 1, IntExpr 2]
    

    followed by some "extra" text. Hopefully, the underlying "language" you're trying to parse isn't so hopelessly broken that this will be an issue.