Search code examples
parsinghaskellparsec

Why Parsec's sepBy stops and does not parse all elements?


I am trying to parse some comma separated string which may or may not contain a string with image dimensions. For example "hello world, 300x300, good bye world".

I've written the following little program:

import Text.Parsec
import qualified Text.Parsec.Text as PS

parseTestString :: Text -> [Maybe (Int, Int)]
parseTestString s = case parse dimensStringParser "" s of
                      Left _ -> [Nothing]
                      Right dimens -> dimens

dimensStringParser :: PS.Parser [Maybe (Int, Int)]
dimensStringParser = (optionMaybe dimensParser) `sepBy` (char ',')

dimensParser :: PS.Parser (Int, Int)
dimensParser = do
  w <- many1 digit
  char 'x'
  h <- many1 digit
  return (read w, read h)

main :: IO ()
main = do
  print $ parseTestString "300x300,40x40,5x5"
  print $ parseTestString "300x300,hello,5x5,6x6"

According to optionMaybe documentation, it returns Nothing if it can't parse, so I would expect to get this output:

[Just (300,300),Just (40,40),Just (5,5)]
[Just (300,300),Nothing, Just (5,5), Just (6,6)]

but instead I get:

[Just (300,300),Just (40,40),Just (5,5)]
[Just (300,300),Nothing]

I.e. parsing stops after first failure. So I have two questions:

  1. Why does it behave this way?
  2. How do I write a correct parser for this case?

Solution

  • In order to answer this question, it's handy to take a piece of paper, write down the input, and act as a dumb parser.

    We start with "300x300,hello,5x5,6x6", our current parser is optionMaybe .... Does our dimensParser correctly parse the dimension? Let's check:

      w <- many1 digit   -- yes, "300"
      char 'x'           -- yes, "x"
      h <- many1 digit   -- yes, "300"
      return (read w, read h) -- never fails
    

    We've successfully parsed the first dimension. The next token is ,, so sepBy successfully parses that as well. Next, we try to parse "hello" and fail:

     w <- many1 digit -- no. 'h' is not a digit. Stop
    

    Next, sepBy tries to parse ,, but that's not possible, since the next token is a 'h', not a ,. Therefore, sepBy stops.

    We haven't parsed all the input, but that's not actually necessary. You would get a proper error message if you've used

    parse (dimensStringParser <* eof)
    

    Either way, if you want to discard anything in the list that's not a dimension, you can use

    dimensStringParser1 :: Parser (Maybe (Int, Int))
    dimensStringParser1 = (Just <$> dimensParser) <|> (skipMany (noneOf ",") >> Nothing)
    
    dimensStringParser = dimensStringParser1  `sepBy` char ','