I'm stuck on a problem with writing a parser in Haskell that I hope someone can help out with!
It is a bit more complicated than my usual parser because there are two layers of parsing. First a language definition is parsed into an AST, then that AST is transformed into another parser that parses the actual language.
I have made pretty good progress so far but I'm stuck on implementing recursion in the language definition. As the language definition is transformed from AST into a parser in a recursive function, I can't work out how it can call itself if it doesn't exist yet.
I'm finding it a bit hard to explain my problem, so maybe an example will help.
The language definition might define that a language consists of three keywords in sequence and then optional recursion in brackets.
A B C ($RECURSE)
Which would be parsed into an AST like:
[Keyword "A", Keyword "B", Keyword "C", Optional (Many [Recurse])]
The Many
is not really required for this example, but in my actual project, optional blocks can have multiple syntax elements in them so an Optional
would contain a Many
with n elements.
I would then want it to get transformed into a parser that parses strings like:
A B C
A B C (A B C)
A B C (A B C (A B C))
I've boiled down my project into the simplest possible example. You can see my TODO comment where I'm stuck trying to implement the recursion.
{-# LANGUAGE OverloadedStrings #-}
module Example
( runExample,
)
where
import Control.Applicative hiding (many, some)
import Data.Text (Text)
import Data.Void
import System.IO as SIO
import Text.Megaparsec hiding (State)
import Text.Megaparsec.Char (space1, string')
import qualified Text.Megaparsec.Char.Lexer as L
import Text.Megaparsec.Debug
import Text.Pretty.Simple (pPrint)
-- Types
type Parser = Parsec Void Text
data SyntaxAst = Keyword Text | Recurse | Optional SyntaxAst | Many [SyntaxAst]
-- Megaparsec Base Parsers
-- Space consumer - used by other parsers to ignore whitespace
sc :: Parser ()
sc =
L.space
space1
(L.skipLineComment "--")
(L.skipBlockComment "/*" "*/")
-- Runs a parser, then consumes any left over space with sc
lexeme :: Parser a -> Parser a
lexeme = L.lexeme sc
-- Parses a string, then consumes any left over space with sc
symbol :: Text -> Parser Text
symbol = L.symbol sc
-- Parses something between parentheses
inParens :: Parser a -> Parser a
inParens =
between
(symbol "(")
(symbol ")")
-- Transforms the AST into a parser
transformSyntaxExprToParser :: SyntaxAst -> Parser [Text]
transformSyntaxExprToParser (Many exprs) = dbg "Many" (createParser exprs)
transformSyntaxExprToParser (Keyword text) = dbg "Keyword" (pure <$> lexeme (string' text))
transformSyntaxExprToParser (Optional inner) = dbg "Optional" (option [] (try (inParens (transformSyntaxExprToParser inner))))
transformSyntaxExprToParser Recurse = dbg "Recurse" (pure ["TODO"]) -- TODO: How do I recurse here?
-- transformSyntaxExprToParser s Recurse = dbg "Recurse" (createParser s) -- Seems to work in the example, but in my actual application creates an infinite loop and freezes
-- Walks over the parser AST and convert it to a parser
createParser :: [SyntaxAst] -> Parser [Text]
createParser expressions =
do
foldr1 (liftA2 (<>)) (fmap transformSyntaxExprToParser expressions)
runExample :: IO ()
runExample = do
-- To make the example simple, lets cut out the language definition parsing and just define
-- it literally.
let languageParser = createParser [Keyword "A", Keyword "B", Keyword "C", Optional (Many [Recurse])]
let run p = runParser p "" "A B C (A B C (A B C))"
let result = run languageParser
case result of
Left bundle -> SIO.putStrLn (errorBundlePretty bundle)
Right xs -> pPrint xs
A few things I've tried:
transformSyntaxExprToParser
function and call createParser
when the Recurse
token is encountered. This didn't work due to infinite loops.I hope that makes sense, let me know if I need to elaborate more. I can also push up my full project if it will help.
Thanks for reading!
Edit: I've made changes to my original example to demonstrate the infinite loop problem (integrating the excellent suggestions in the answer below) at https://pastebin.com/DN0JJ9BA
I believe you can use laziness here. Pass the final parser as a parameter to transformSyntaxExprToParser
, and when you see a Recurse
, return that parser.
transformSyntaxExprToParser :: Parser [Text] -> SyntaxAst -> Parser [Text]
transformSyntaxExprToParser self = go
where
go (Keyword text) = dbg "Keyword" (pure <$> lexeme (string' text))
go (Optional inner) = dbg "Optional" (option [] (try (inParens (go inner))))
go Recurse = dbg "Recurse" self
createParser :: [SyntaxAst] -> Parser [Text]
createParser expressions = parser
where
parser = foldr1 (liftA2 (<>))
(fmap (transformSyntaxExprToParser parser) expressions)
This ought to produce exactly the same kind of recursive parser as if you had written it directly. A Parser
is ultimately just a data structure which you can construct using its instances of Monad
, Applicative
, Alternative
, &c.
Your idea of doing this with a mutable reference such as an IORef
is essentially what’s happening under the hood anyway when constructing and evaluating a thunk.
Your idea here was almost correct:
Pass the original AST up to the
transformSyntaxExprToParser
function and callcreateParser
when theRecurse
token is encountered. This didn't work due to infinite loops.
The problem is that you were constructing a new parser for every Recurse
, from the same input, which contains a Recurse
, thus constructing a new parser…and so on. What my code above does is just pass in the same parser.
If you need to perform monadic side effects while constructing the parser, such as logging, then you can use a recursive do
, for example, with some hypothetical MonadLog
class for illustration:
{-# Language RecursiveDo #-}
transformSyntaxExprToParser :: (MonadLog m) => Parser [Text] -> SyntaxAst -> m (Parser [Text])
transformSyntaxExprToParser self = go
where
go (Keyword text) = do
logMessage "Got ‘Keyword’"
pure $ dbg "Keyword" (pure <$> lexeme (string' text))
go (Optional inner) = do
logMessage "Got ‘Optional’"
inner' <- go inner
pure $ dbg "Optional" (option [] (try (inParens inner')))
go Recurse = do
logMessage "Got ‘Recurse’"
pure $ dbg "Recurse" self
createParser :: (MonadFix m, MonadLog m) => [SyntaxAst] -> m (Parser [Text])
createParser expressions = do
rec
parser <- fmap (foldr1 (liftA2 (<>)))
(traverse (transformSyntaxExprToParser parser) expressions)
pure parser
The rec
block introduces a recursive binding which you may construct using side effects. In general, some care is required to ensure that recursive definitions like this are sufficiently lazy, that is, you don’t force the result sooner than intended, but here the recursion pattern is very simple, and you never examine the self
parser, only treat it as a black box to hook up to other parsers.
This method also makes it explicit what the scope of a Recurse
is, and opens the possibility of introducing local recursive parsers, with a new call to transformSyntaxExprToParser
with a new local self
argument.