Search code examples
parsinghaskellmegaparsec

Parse many terms not followed by symbol


I am attempting to develop a lambda calculus interpreter, which supports definitions of terms. For that, I'm using the Megaparsec library in Haskell.

I have the following grammar:

term := var | lambda var . term | term term
def := var = term
instruction := :e t | def

My parser should return a list of instructions, where :e t will evaluate to whatever t is reduced, and def will allow me to have named definitions.

So far, I've managed to parse a very similar grammar, except I'm using semicolons to separate each instruction:

...
instruction := :e t; | def;

I'd like to remove the need for those semicolons, but I'm having no luck with that. My code is currently:

sc :: Parser ()
sc = L.space space1 (L.skipLineComment "--") (L.skipBlockComment "{-" "-}")

lexeme :: Parser a -> Parser a
lexeme = L.lexeme sc

symbol :: Text -> Parser Text
symbol = L.symbol sc

semicolon :: Parser Text
semicolon = symbol ";"

lowercaseWord :: Parser [Char]
lowercaseWord = some lowerChar

variable :: Parser Term
variable = lexeme lowercaseWord <&> Var

lamAbs :: Parser Term
lamAbs = do
  symbol "lambda"
  space
  v <- lexeme lowercaseWord
  space
  symbol "."
  space
  t <- term
  space
  return $ Abs v t

term :: Parser Term
term = foldl1 App <$> sepBy1 (parens term <|> try lamAbs <|> variable) sc
  where parens = between (symbol "(") (symbol ")")

definition :: Parser (Instruction Term)
definition = do
  n <- lexeme lowercaseWord
  space
  symbol "="
  space
  t <- term
  return $ IDef n t

commandEval :: Parser (Instruction Term)
commandEval = do
  symbol ":e"
  t <- term
  optional newline
  return $ IEval t

program :: Parser [Instruction Term]
program = sc *> sepEndBy (definition <|> commandEval) semicolon <* eof

parse :: ProgramParser (Instruction Term)
parse text = case runParser program "" text of
  Left err  -> Left $ errorBundlePretty err
  Right ast -> Right ast

When I attempt to remove the need for semicolons by changing the definition of program to:

program = sc *> some (definition <|> commandEval) <* eof

the parser will not be able to figure out that it needs to stop parsing a term once it finds something like var = ..., to continue parsing the next definition.

For example, the following input:

a = (lambda x . x)
    x

:e b a

b = lambda y . y

returns the following error:

6:3:
  |
6 | b = lambda y . y
  |   ^
unexpected '='
expecting ":e", "lambda", '(', end of input, lowercase letter, or newline

I tried to change the definition of term to make it explicit that it should not be followed by a lowercase word and a =. But that didn't work either:

term = foldl1 App <$> sepBy1 ((parens term <|> try lamAbs <|> variable) <* notFollowedBy newdef) sc
  where parens = between (symbol "(") (symbol ")")
        newdef = (lexeme lowercaseWord *> space) *> symbol "="

How can I make it so that I don't have the need for semicolons, and my parser automatically stops parsing a term application once it finds var = term?

Thank you in advance for your help.


Solution

  • As noted by @Poselsky, the point of using Text.Megaparsec.Char.Lexer is that you don't need (and shouldn't have) to deal with white space in your higher-level parsers. So, I'd start by deleting every use of space in lamAbs and definition, deleting the optional newline from commandEval and replacing sepBy1 xxx sc with some xxx in term. (In all these cases, these parsers perform no function. They are run on the input stream after the previous lexeme parser has already absorbed all trailing whitespace.)

    I'd also move the lexeme call into lowercaseWord:

    lowercaseWord :: Parser [Char]
    lowercaseWord = lexeme (some lowerChar)
    

    Then, except for your primitive parsers (now, symbol and lowercaseWord), the only place you should need to deal with whitespace is the sc at the start of your top-level parser program. Every other parser should be assumed to start at non-whitespace and absorb trailing whitespace, and those parsers can be combined in sequence to create another parser that follows the same rule, so no stray space or sc or newline calls are needed.

    That will simplify your code, though it won't fix your problem.

    Your attempted fix is the right idea. The problem is that it can't parse the last variable before a newdef because that variable is followed by a newdef.

    You don't actually need to try to parse a whole definition. All you really need to do is make sure a variable isn't followed by a symbol "=", and you need to wrap it in a try block to backtrack on the name parsed by variable:

    term :: Parser Term
    term = foldl1 App <$> some (parens term <|> try lamAbs <|>
                                try (variable <* notFollowedBy (symbol "=")))
      where parens = between (symbol "(") (symbol ")")
    

    Do be warned that there are still some problems with your parser. It accepts the following input, for example, which you might not expect:

    x = lambdalicious. man
    

    (Note that this has nothing to do with removing the space calls -- your original parser parsed this, too.)

    The problem is that symbol "lambda" can match the prefix of a larger word, so "lambdalicious" is parsed as "lambda licious". There's a similar issue with symbol ":e", if you don't want ":eval" to parse as ":e val".

    The usual solution is to define:

    reserved :: Text -> Parser Text
    reserved s = lexeme $ try (string s <* notFollowedBy lowerChar)
    

    and then the following work as intended:

    reserved "lambda"
    reserved ":e"
    

    Here's my full test program with all the above revisions:

    {-# LANGUAGE OverloadedStrings #-}
    
    module LambdaParser where
    
    import Data.Void
    
    import Data.Functor
    import Data.Text (Text)
    import Text.Megaparsec hiding (parse)
    import Text.Megaparsec.Char
    import qualified Text.Megaparsec.Char.Lexer as L
    
    type Parser = Parsec Void Text
    
    data Term = Var String | Instruction Term | Abs String Term | App Term Term deriving (Show)
    data Instruction t = IDef String t | IEval Term deriving (Show)
    
    sc :: Parser ()
    sc = L.space space1 (L.skipLineComment "--") (L.skipBlockComment "{-" "-}")
    
    lexeme :: Parser a -> Parser a
    lexeme = L.lexeme sc
    
    symbol :: Text -> Parser Text
    symbol = L.symbol sc
    
    reserved :: Text -> Parser Text
    reserved s = lexeme $ try (string s <* notFollowedBy lowerChar)
    
    semicolon :: Parser Text
    semicolon = symbol ";"
    
    lowercaseWord :: Parser [Char]
    lowercaseWord = lexeme (some lowerChar)
    
    variable :: Parser Term
    variable = lowercaseWord <&> Var
    
    lamAbs :: Parser Term
    lamAbs = do
      reserved "lambda"
      v <- lowercaseWord
      symbol "."
      t <- term
      return $ Abs v t
    
    term :: Parser Term
    term = foldl1 App <$> some (parens term <|> try lamAbs <|>
                                try (variable <* notFollowedBy (symbol "=")))
      where parens = between (symbol "(") (symbol ")")
    
    definition :: Parser (Instruction Term)
    definition = do
      n <- lowercaseWord
      symbol "="
      t <- term
      return $ IDef n t
    
    commandEval :: Parser (Instruction Term)
    commandEval = do
      reserved ":e"
      t <- term
      return $ IEval t
    
    program :: Parser [Instruction Term]
    program = sc *> some (definition <|> commandEval) <* eof
    
    parse :: Text -> Either String [Instruction Term]
    parse text = case runParser program "" text of
      Left err  -> Left $ errorBundlePretty err
      Right ast -> Right ast
    
    main = do
      let ex1 = "a = (lambda x . x)\n   x\n\n:e b a\n\nb = lambda y . y"
      case parse ex1 of
        Left err -> putStrLn err
        Right pgm -> print pgm