Search code examples
haskellparsec

Correctly splitting by string with Parsec


The code below outputs Right ["1<!>2<!>3"], but I need Right ["1", "2", "3"].

import Text.ParserCombinators.Parsec

response = contents :: CharParser () [String]
  where
    contents = sepBy content contentDelimiter
    contentDelimiter = string "<!>"
    content = many anyChar

main = do
  putStrLn $ show $ parse response "Response" "1<!>2<!>3"

I suppose the problem here is that the content parser consumes all the input before sepBy gets to test the delimiter. So, my questions are:

  1. Am I correct with my assumption? If not, what is the mistake I've made?

  2. What solution would you recommend for such a problem? (Using Parsec)

* content has to match any string not containing the delimiter. The 1<!>2<!>3 is just an example it can be dslkf\n><!>dsf<!>3 or whatever


Solution

  • For your first example, you would replace

    content = many anyChar
    

    with

    content = many digit
    

    So that the parser of the content doesn't erroneously match the separator.

    Maybe you want to match more than just digits but even so, I advise you to think carefully about what is valid between <!>s and write a parser that does that.

    Why?
    Once you've got a really good parser for content, your definition for response will be perfect. This way your content can include mystring = "hello<!>mum" without being chopped by the top level parser - the low level stringLiteral parser will eat the whole "hello<!>mum" and the top level parser will never see the <!> correctly and innocently included inside it.

    Generally,...
    In most parsing situations it's best to be really clear what's allowed in your content, and parse only that, for three reasons:

    • reusability (you can then use this within a larger parser)
    • correctness
    • usually efficiency - if you avoid too much lookahead, your parser runs faster.

    Reusability is important. At the moment, if you use a parser that just splits on <!> and eats everything else, it's guaranteed to eat the whole input, and you won't be able to do any more parsing.

    Bottom-up
    Your parsers should work from the ground up - you described this very well in your comment as "stacking the parsers from specific to general".

    It's easiest to write them in that order for ease of testing, so first write one that matches a stringChar then stringLiteral before member before array before object before json before content then response. You can have them calling each other recursively along the way. You can then use parseTest to test each little one as you do along; typing parseTest response "1<!>2<!>3" into ghci is quicker than rewriting main and compiling.

    Top-down?
    It's not wrong to write your parser top-down, just harder. You can write

    response = many $ content `sepBy` contentSeparator
    content = json <|> somethingElse
    json = object <|> array
    array = ...
    

    but nothing is testable until you've written the very smallest parser.