I've written several compilers and am familiar with lexers, regexs/NFAs/DFAs, parsers and semantic rules in flex/bison, JavaCC, JavaCup, antlr4 and so on.
Is there some sort of magical monadic operator that seamlessly grows/combines a token with a mix of Parser Char
(ie Text.Megaparsec.Char
) vs. Parser String
?
Is there a way / best practices to represent a clean separation of lexing tokens and nonterminal expectations?
Typically, one uses applicative operations to directly combine Parser Char
and Parser String
s, rather than "upgrading" the former. For example, a parser for alphanumeric identifiers that must start with a letter would probably look like:
ident :: Parser String
ident = (:) <$> letterChar <*> alphaNumChar
If you were doing something more complicated, like parsing dollar amounts with optional cents, for example, you might write:
dollars :: Parser String
dollars = (:) <$> char '$' <*> some digitChar
<**> pure (++)
<*> option "" ((:) <$> char '.' <*> replicateM 2 digitChar)
If you find yourself trying to build a Parser String
out of a complicated sequence of Parser Char
and Parser String
parsers in a lot of situations, then you could define a few helper operators. If you find the variety of operators annoying, you could just define (<++>)
and a short-form for charToStr
like c :: Parser Char -> Parser String
.
(<.+>) :: Parser Char -> Parser String -> Parser String
p <.+> q = (:) <$> p <*> q
infixr 5 <.+>
(<++>) :: Parser String -> Parser String -> Parser String
p <++> q = (++) <$> p <*> q
infixr 5 <++>
(<..>) :: Parser Char -> Parser Char -> Parser String
p <..> q = p <.+> fmap (:[]) q
infixr 5 <..>
so you can write something like:
dollars' :: Parser String
dollars' = char '$' <.+> some digitChar
<++> option "" (char '.' <.+> digitChar <..> digitChar)
As @leftroundabout says, there's nothing hackish about fmap (:[])
. If you prefer, write fmap (\c -> [c])
if you think it looks clearer.