Search code examples
parsinghaskellmegaparsec

How to report multiple errors using megaparsec?


Per megaparsec docs, "Since version 8, reporting multiple parse errors at once has become much easier." I haven't been able to find a single example of doing it. The only one I find is this. However it only shows how to parse a newline delimited toy language and also does not show how to combine multiple errors into ParseErrorBundle. This SO discussion is not conclusive.


Solution

  • You want to use withRecovery to recover from Megaparsec-generated errors in conjunction with registerParseError (or registerFailure or registerFancyFailure) to "register" those errors (or your own generated errors) for delayed processing.

    At the end of the parse, if no parse errors have been registered, then parsing succeeds, while if one or more parse errors have been registered, they are all printed at once. If you register parse errors and then also trigger an unrecovered error, parsing immediately terminates and the registered errors and the final unrecovered error will all be printed.

    Here's a very simple example that parses a comma-separated list of numbers:

    import Data.Void
    import Text.Megaparsec
    import Text.Megaparsec.Char
    
    type Parser = Parsec Void String
    
    numbers :: Parser [Int]
    numbers = sepBy number comma <* eof
      where number = read <$> some digitChar
            comma  = recover $ char ','
            -- recover to next comma
            recover = withRecovery $ \e -> do
              registerParseError e
              some (anySingleBut ',')
              char ','
    

    On good input:

    > parseTest numbers "1,2,3,4,5"
    [1,2,3,4,5]
    

    and on input with multiple errors:

    > parseTest numbers "1.2,3e5,4,5x"
    1:2:
      |
    1 | 1.2,3e5,4,5x
      |  ^
    unexpected '.'
    expecting ','
    
    1:6:
      |
    1 | 1.2,3e5,4,5x
      |      ^
    unexpected 'e'
    expecting ','
    
    1:12:
      |
    1 | 1.2,3e5,4,5x
      |            ^
    unexpected 'x'
    expecting ',', digit, or end of input
    

    There are some subtleties here. For the following, only the first parse error is handled:

    > parseTest numbers "1,2,e,4,5x"
    1:5:
      |
    1 | 1,2,e,4,5x
      |     ^
    unexpected 'e'
    expecting digit
    

    and you have to study the parser carefully to see why. The sepBy successfully applies the number and comma parser in alternating sequence to parse "1,2,". When it gets to e, it applies the number parser which fails (because some digitChar requires at least one digit char). This is an unrecovered error, so parsing ends immediately with no other errors registered, so only the one error is printed.

    Also, if you dropped the <* eof from the definition of numbers (e.g., to make it part of a larger parser), you'd discover that:

    > parseTest numbers "1,2,3.4,5"
    

    gives a parse error on the period, but:

    > parseTest numbers "1,2,3.4"
    

    parses fine. On the other hand:

    > parseTest numbers "1,2,3.4\n hundreds of lines without commas\nfinal line, with comma"
    

    gives parse errors on the period and the comma at the end of the file.

    The issue is that the comma parser is used by sepBy to determine when the comma-separated list of numbers has ended. If the parser succeeds (which it can do via recovery, gobbling up hundreds of lines to the next comma), sepBy will try to keep running; if the parser fails (both initially, and because the recovery code can't find a comma after scanning the entire file), sepBy will complete.

    Ultimately, writing recoverable parsers is kind of tricky.