Search code examples
parsinghaskellparsecmegaparsec

Grouping lines with Parsec


I have a line-based text format I want to parse with Parsec†. A line either starts with a pound sign and specifies a key value pair separated by a colon or is a URL that is described by the previous tags.

Here's a short example:

#foo:bar
#faz:baz
https://example.com
#foo:beep
https://example.net

For simplicity's sake, I'm going to store everything as String. A Tag is a type Tag = (String, String), for example ("foo", "bar"). Ultimately, I'd like to group these as ([Tag], URL).

However, I struggle figuring out how to parse either [one or more tags] or [one URL].

My current approach looks like this:

import qualified System.Environment   as Env
import qualified Text.Megaparsec      as M
import qualified Text.Megaparsec.Text as M

type Tag = (String, String)

data Segment = Tags [Tag] | URL String
  deriving (Eq, Show)

tagP :: M.Parser Tag
tagP = M.char '#' *> ((,) <$> M.someTill M.printChar (M.char ':') <*> M.someTill M.printChar M.eol) M.<?> "Tag starting with #"

urlP :: M.Parser String
urlP = M.someTill M.printChar M.eol M.<?> "Some URL"

parser :: M.Parser Segment
parser = (Tags <$> M.many tagP) M.<|> (URL <$> urlP)

main :: IO ()
main = do
  fname <- head <$> Env.getArgs
  res <- M.parseFromFile (parser <* M.eof) fname
  print res

If I try to run this on the above sample, I get a parsing error like this:

3:1:
unexpected 'h'
expecting Tag starting with # or end of input

Clearly my use of many in combination with <|> is incorrect. Since the tag parser won't consume any input from the URL parser it cannot be related to backtracking. How do I need to change this to get to the desired result?

The full example is available on GitHub.


† I'm actually using MegaParsec here for better error messages but I think the problem is quite generic and not about any particular implementation of parser combinators.


Solution

  • @cocreature answered this for me on Twitter.

    As leftaroundabout pointed out here, there are two separate mistakes in my code:

    1. The parser itself misuses <|> while it should just sequentially parse the lines and skip to the next parser if it doesn't consume any input.
    2. The invocation (parseFromFile) only applies the parser function a single time and would fail as soon as it would get to the second block.

    We can fix the parser and introduce grouping in one go:

    parser :: M.Parser ([Tag], String)
    parser = liftA2 (,) (M.many tagP) urlP
    

    Afterwards, we just need to apply the change suggested by leftaroundabout:

    ...
    res <- M.parseFromFile (M.many parser <* M.eof) fname
    

    Running this leads to the desired result:

    [([("foo","bar"),("faz","baz")],"https://example.com"),([("foo","beep")],"https://example.net")]