I am trying to parse a file that looks like:
a b c
f e d
I want to match each of the symbols in the line and parse everything into a list of lists such as:
[[A, B, C], [D, E, F]]
In order to do that I tried the following:
import Control.Monad
import Text.ParserCombinators.Parsec
import Text.ParserCombinators.Parsec.Language
import qualified Text.ParserCombinators.Parsec.Token as P
parserP :: Parser [[MyType]]
parserP = do
x <- rowP
xs <- many (newline >> rowP)
return (x : xs)
rowP :: Parser [MyType]
rowP = manyTill cellP $ void newline <|> eof
cellP :: Parser (Cell Color)
cellP = aP <|> bP <|> ... -- rest of the parsers, they all look very similar
aP :: Parser MyType
aP = symbol "a" >> return A
bP :: Parser MyType
bP = symbol "b" >> return B
lexer = P.makeTokenParser emptyDef
symbol = P.symbol lexer
But it fails to return multiple inner lists. Instead what I get is:
[[A, B, C, D, E, F]]
What am I doing wrong? I was expecting manyTill to parse cellP until the newline character, but that's not the case.
You're right that manyTill
keeps parsing until a newline. But manyTill
never gets to see the newline because cellP
is too eager. cellP
ends up calling P.symbol
, whose documentation states
symbol :: String -> ParsecT s u m String
Lexeme parser symbol s parses string s and skips trailing white space.
The keyword there is 'white space'. It turns out, Parsec defines whitespace as being any character which satisfies isSpace
, which includes newlines. So P.symbol
is happily consuming the c
, followed by the space and the newline, and then manyTill
looks and doesn't see a newline because it's already been consumed.
If you want to drop the Parsec routine, go with Benjamin's solution. But if you're determined to stick with that, the basic idea is that you want to modify the language's whiteSpace
field to correctly define whitespace to not be newlines. Something like
lexer = let lexer0 = P.makeTokenParser emptyDef
in lexer0 { whiteSpace = void $ many (oneOf " \t") }
That's pseudocode and probably won't work for your specific case, but the idea is there. You want to change the definition of whiteSpace
to be whatever you want to define as whiteSpace
, not what the system defines by default. Note that changing this will also break your comment syntax, if you have one defined, since whiteSpace
was previously equipped to handle comments.
In short, Benjamin's answer is probably the best way to go. There's no real reason to use Parsec here. But it's also helpful to know why this particular solution didn't work: Parsec's default definition of a language wasn't designed to treat newlines with significance.