Search code examples
haskellmonadstext-parsinglexermegaparsec

Mixing Parser Char (lexer?) vs. Parser String


I've written several compilers and am familiar with lexers, regexs/NFAs/DFAs, parsers and semantic rules in flex/bison, JavaCC, JavaCup, antlr4 and so on.

Is there some sort of magical monadic operator that seamlessly grows/combines a token with a mix of Parser Char (ie Text.Megaparsec.Char) vs. Parser String?

Is there a way / best practices to represent a clean separation of lexing tokens and nonterminal expectations?


Solution

  • Typically, one uses applicative operations to directly combine Parser Char and Parser Strings, rather than "upgrading" the former. For example, a parser for alphanumeric identifiers that must start with a letter would probably look like:

    ident :: Parser String
    ident = (:) <$> letterChar <*> alphaNumChar
    

    If you were doing something more complicated, like parsing dollar amounts with optional cents, for example, you might write:

    dollars :: Parser String
    dollars = (:) <$> char '$' <*> some digitChar
              <**> pure (++)
              <*> option "" ((:) <$> char '.' <*> replicateM 2 digitChar)
    

    If you find yourself trying to build a Parser String out of a complicated sequence of Parser Char and Parser String parsers in a lot of situations, then you could define a few helper operators. If you find the variety of operators annoying, you could just define (<++>) and a short-form for charToStr like c :: Parser Char -> Parser String.

    (<.+>) :: Parser Char -> Parser String -> Parser String
    p <.+> q = (:) <$> p <*> q
    infixr 5 <.+>
    
    (<++>) :: Parser String -> Parser String -> Parser String
    p <++> q = (++) <$> p <*> q
    infixr 5 <++>
    
    (<..>) :: Parser Char -> Parser Char -> Parser String
    p <..> q = p <.+> fmap (:[]) q
    infixr 5 <..>
    

    so you can write something like:

    dollars' :: Parser String
    dollars' = char '$' <.+> some digitChar 
               <++> option "" (char '.' <.+> digitChar <..> digitChar)
    

    As @leftroundabout says, there's nothing hackish about fmap (:[]). If you prefer, write fmap (\c -> [c]) if you think it looks clearer.