Search code examples
parsinghaskellparser-combinatorsmegaparsec

Parsing inside `between` with Megaparsec


I am writing a parser for a markdown-like document format. I want to be able to match something like ^[some *formatted* text] as a footnote in my syntax definition. Here's a minimal example:

{- cabal:
build-depends: base, text, megaparsec, parser-combinators, hspec, hspec-megaparsec
-}
{-# LANGUAGE ImportQualifiedPost #-}
{-# LANGUAGE OverloadedStrings #-}

import Data.Text (Text)
import Data.Void (Void)
import Test.Hspec
import Test.Hspec.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char
import Text.Megaparsec.Char.Lexer qualified as L

type Parser = Parsec Void Text

data Words
  = PlainText Text
  | BoldText Text
  | MagicText [Words]
  deriving (Show, Eq)

text_ :: Parser Words
text_ =
  choice
    [
      MagicText <$> between (string "^[") (char ']') (manyTill (text_ <* optional space) (char ']')),
      BoldText <$> between (char '*') (char '*') (takeWhile1P (Just "bold text") (/= '*')),
      PlainText <$> takeWhile1P (Just "plain text") (\c -> c /= ' ' && c /= '\n')
    ]

main :: IO ()
main = hspec $ do
  context "for basic one-word-at-a-time input" $ do
    it "parses plain text" $ parse text_ "" "hello" `shouldParse` PlainText "hello"
    it "parses bold text" $ parse text_ "" "*hello*" `shouldParse` BoldText "hello"

  context "parses nested \"MagicText\"" $ do
    it "on it's own with just one word inside" $
      parse text_ "" "^[hello]" `shouldParse` MagicText [PlainText "hello"]

    it "on it's own with bold text inside" $
      parse text_ "" "^[*hello*]" `shouldParse` MagicText [BoldText "hello"]

The last two test cases fail with the following errors:

~/sandbox > cabal run ParseBetween.hs

for basic one-word-at-a-time input
  parses plain text [✔]
  parses bold text [✔]
parses nested "MagicText"
  on it's own with just one word inside [✘]
  on it's own with bold text inside [✘]

Failures:

  /home/gideon/sandbox/ParseBetween.hs:43:33: 
  1) parses nested "MagicText" on it's own with just one word inside
       expected: MagicText [PlainText "hello"]
       but parsing failed with error:
         1:9:
           |
         1 | ^[hello]
           |         ^
         unexpected end of input
         expecting "^[", '*', ']', plain text, or white space

  To rerun use: --match "/parses nested \"MagicText\"/on it's own with just one word inside/" --seed 100639639

  /home/gideon/sandbox/ParseBetween.hs:46:35: 
  2) parses nested "MagicText" on it's own with bold text inside
       expected: MagicText [BoldText "hello"]
       but parsing failed with error:
         1:11:
           |
         1 | ^[*hello*]
           |           ^
         unexpected end of input
         expecting ']'

  To rerun use: --match "/parses nested \"MagicText\"/on it's own with bold text inside/" --seed 100639639

From the definition of manyTill_ I would expect it to match the ending ] first, and therefore not run into this "unexpected end-of-input" error, but I can't work out how to have this nested parsing behaviour in a way which works.


Solution

  • I can't see by inspection what's wrong with your bold-text example. But the problem with "[hello]" is simple enough. You start parsing MagicText, which consumes the [ and delegates to text_ again, planning to consume a ] afterwards. But the parser inside PlainText doesn't know it's supposed to leave behind a ] character. It happily consumes all the way to the end of the string, because it never encounters one of its stop characters, ' ' or '\n'. Then it completes, and the MagicText above it is upset it can't find its closing ].

    A common way to handle problems like this is to have a grammar with more explicit separations of its concepts, encoded in a hierarchy. A MagicText doesn't contain "any text, including magic, bold, or plain text": it includes "bold text or plain text". A BoldText doesn't contain "any text, including magic, bold, or plain text": it contains only plain text. And PlainText explicitly rejects characters that would be treated as delimiters/metacharacters for the levels above it. Roughly like this:

    text_ :: Parser Words
    text_ =
      choice
        [
          MagicText <$> between (string "^[") (char ']') (nonMagicText `sepBy1` space),
          nonMagicText
        ]
    
    nonMagicText = 
      choice
        [
          BoldText <$> between (char '*') (char '*') plainText,
          PlainText <$> plainText
        ]
    
    plainText = 
      takeWhile1P (Just "plaintext") (`notElem` "*^[] \n")