Search code examples
parsinghaskelloperator-precedenceparsecmegaparsec

Correctly parsing nested data using megaparsec


I am trying to get more familiar with megaparsec, and I am running into some issues with presedences. By 'nested data' in the title I refer to the fact that I am trying to parse types, which in turn could contain other types. If someone could explain why this does not behave as I would expect, please don't hesitate to tell me.

I am trying to parse types similar to those found in Haskell. Types are either base types Int, Bool, Float or type variables a (any lowercase word). We can also construct algebraic data types from type constructors (Uppercase words) such as Maybe and type parameters (any other type). Examples are Maybe a and Either (Maybe Int) Bool. Functions associate to the right and are constructed with ->, such as Maybe a -> Either Int (b -> c). N-ary tuples are a sequence of types separated by , and enclosed in ( and ), such as (Int, Bool, a). A type can be wrapped in parenthesis to raise its precedence level (Maybe a). A unit type () is also defined.

I am using this ADT to describe this.

newtype Ident  = Ident String
newtype UIdent = UIdent String
data Type a
    = TLam a (Type a) (Type a)
    | TVar a Ident
    | TNil a
    | TAdt a UIdent [Type a]
    | TTup a [Type a]
    | TBool a
    | TInt a
    | TFloat a

I have tried to write a megaparsec parser to parse such types, but I get unexpected results. I attach the relevant code below after which I will try to describe what I experience.

{-# LANGUAGE OverloadedStrings #-}
module Parser where

import AbsTinyCamiot

import Text.Megaparsec
import Text.Megaparsec.Char
import qualified Text.Megaparsec.Char.Lexer as Lexer
import Text.Megaparsec.Debug

import Control.Applicative hiding (many, some, Const)
import Control.Monad.Combinators.Expr
import Control.Monad.Identity
import Data.Void
import Data.Text (Text, unpack)

type Parser a = ParsecT Void Text Identity a

-- parse types

pBaseType :: Parser (Type ())
pBaseType = choice [
    TInt   () <$  label "parse int"           (pSymbol "Int"),
    TBool  () <$  label "parse bool"          (pSymbol "Bool"),
    TFloat () <$  label "parse float"         (pSymbol "Float"),
    TNil   () <$  label "parse void"          (pSymbol "()"),
    TVar   () <$> label "parse type variable" pIdent]

pAdt :: Parser (Type ())
pAdt = label "parse ADT" $ do
    con <- pUIdent
    variables <- many $ try $ many spaceChar >> pType
    return $ TAdt () con variables

pType :: Parser (Type ())
pType = label "parse a type" $ 
        makeExprParser 
        (choice [ try pFunctionType
                , try $ parens pType
                , try pTupleType
                , try pBaseType
                , try pAdt
                ]) 
        []--[[InfixR (TLam () <$ pSymbol "->")]]

pTupleType :: Parser (Type ())
pTupleType = label "parse a tuple type" $ do
    pSymbol "("
    fst  <- pType
    rest <- some (pSymbol "," >> pType)
    pSymbol ")"
    return $ TTup () (fst : rest)

pFunctionType :: Parser (Type ())
pFunctionType = label "parse a function type" $ do
    domain <- pType
    some spaceChar
    pSymbol "->"
    some spaceChar
    codomain <- pType
    return $ TLam () domain codomain

parens :: Parser a -> Parser a
parens p = label "parse a type wrapped in parentheses" $ do
    pSymbol "("
    a <- p
    pSymbol ")"
    return a

pUIdent :: Parser UIdent
pUIdent = label "parse a UIdent" $ do
    a <- upperChar
    rest <- many $ choice [letterChar, digitChar, char '_']
    return $ UIdent (a:rest)

pIdent :: Parser Ident
pIdent = label "parse an Ident" $ do
    a <- lowerChar
    rest <- many $ choice [letterChar, digitChar, char '_']
    return $ Ident (a:rest)

pSymbol :: Text -> Parser Text
pSymbol = Lexer.symbol pSpace

pSpace :: Parser ()
pSpace = Lexer.space 
           (void spaceChar) 
           (Lexer.skipLineComment "--") 
           (Lexer.skipBlockComment "{-" "-}")

This might be overwhelming, so let me explain some key points. I understand that I have a lot of different constructions that could match on an opening parenthesis, so I've wrapped those parsers in try, such that if they fail I can try the next parser that might consume an opening parenthesis. Perhaps I am using try too much? Does it affect performance to potentially backtrack so much?

I've also tried to make an expression parser by defining some terms and an operator table. You can see now that I've commented out the operator (function arrow), however. As the code looks right now I loop infinitely when I try to parse a function type. I think this might be due to the fact that when I try to parse a function type (invoked from pType) I immediately try to parse a type representing the domain of the function, which again call pType. How would I do this correctly?

If I decide to use the operator table instead, and not use my custom parser for function types, I parse things using wrong precedences. E.g Maybe a -> b gets parsed as Maybe (a -> b), while I would want it to be parsed as (Maybe a) -> b. Is there a way where I could use the operator table and still have type constructors bind more tightly than the function arrow?

Lastly, as I am learning megaparsec as I go, if anyone sees any misunderstandings or things that are wierd/unexpected, please tell me. I've read most of this tutorial to get my this far.

Please let me know of any edits I can make to increase the quality of my question!


Solution

  • Your code does not handle precedences at all, and also as a result of this it uses looping left-recursion.

    To give an example of left-recursion in your code, pFunctionType calls pType as the first action, which calls pFunctionType as the first action. This is clearly a loop.

    For precedences, I recommend to look at tutorials on "recursive descent operator parsing", a quick Google search reveals that there are several of them. Nevertheless I can summarize here the key points. I write some code.

    {-# language OverloadedStrings #-}
    
    import Control.Monad.Identity
    import Data.Text (Text)
    import Data.Void
    import Text.Megaparsec
    import Text.Megaparsec.Char
    import qualified Text.Megaparsec.Char.Lexer as Lexer
    
    type Parser a = ParsecT Void Text Identity a
    
    newtype Ident  = Ident String deriving Show
    newtype UIdent = UIdent String deriving Show
    
    data Type
        = TVar Ident
        | TFun Type Type       -- instead of "TLam"
        | TAdt UIdent [Type]
        | TTup [Type]
        | TUnit                -- instead of "TNil"
        | TBool
        | TInt
        | TFloat
        deriving Show
    
    pSymbol :: Text -> Parser Text
    pSymbol = Lexer.symbol pSpace
    
    pChar :: Char -> Parser ()
    pChar c = void (char c <* pSpace)
    
    pSpace :: Parser ()
    pSpace = Lexer.space
               (void spaceChar)
               (Lexer.skipLineComment "--")
               (Lexer.skipBlockComment "{-" "-}")
    
    keywords :: [String]
    keywords = ["Bool", "Int", "Float"]
    
    pUIdent :: Parser UIdent
    pUIdent = try $ do
        a <- upperChar
        rest <- many $ choice [letterChar, digitChar, char '_']
        pSpace
        let x = a:rest
        if elem x keywords
          then fail "expected an ADT name"
          else pure $ UIdent x
    
    pIdent :: Parser Ident
    pIdent = try $ do
        a <- lowerChar
        rest <- many $ choice [letterChar, digitChar, char '_']
        pSpace
        return $ Ident (a:rest)
    

    Let's stop here.

    • I changed the names of constructors in Type to conform to how they are called in Haskell. I also removed the parameter on Type, to have less noise in my example, but you can add it back of course.
    • Note the changed pUIdent and the addition of keywords. In general, if you want to parse identifiers, you have to disambiguate them from keywords. In this case, Int could parse both as Int and as an upper case identifier, so we have to specify that Int is not an identifier.

    Continuing:

    pClosed :: Parser Type
    pClosed =
          (TInt   <$  pSymbol "Int")
      <|> (TBool  <$  pSymbol "Bool")
      <|> (TFloat <$  pSymbol "Float")
      <|> (TVar   <$> pIdent)
      <|> (do pChar '('
              ts <- sepBy1 pFun (pChar ',') <* pChar ')'
              case ts of
                []  -> pure TUnit
                [t] -> pure t
                _   -> pure (TTup ts))
    
    pApp :: Parser Type
    pApp = (TAdt <$> pUIdent <*> many pClosed)
       <|> pClosed
    
    pFun :: Parser Type
    pFun = foldr1 TFun <$> sepBy1 pApp (pSymbol "->")
    
    pExpr :: Parser Type
    pExpr = pSpace *> pFun <* eof
    

    We have to group operators according to binding strength. For each strength, we need to have a separate parsing function which parses all operators of that strength. In this case we have pFun, pApp and pClosed in increasing order of binding strength. pExpr is just a wrapper which handles top-level expressions, and takes care of leading whitespace and matches the end of the input.

    When writing an operator parser, the first thing we should pin down is the group of closed expressions. Closed expressions are delimited by a keyword or symbol both on the left and the right. This is conceptually "infinite" binding strength, since text before and after such expressions don't change their parsing at all.

    Keywords and variables are clearly closed, since they consist of a single token. We also have three more closed cases: the unit type, tuples and parenthesized expressions. Since all of these start with a (, I factor this out. After that, we have one or more types separated by , and we have to branch on the number of parsed types.

    The rule in precedence parsing is that when parsing an operator expression of given strength, we always call the next stronger expression parser when reading the expressions between operator symbols.

    , is the weakest operator, so we call the function for the second weakest operator, pFun.

    pFun in turn calls pApp, which reads ADT applications, or falls back to pClosed. In pFun you can also see the handling of right associativity, as we use foldr1 TFun to combine expressions. In a left-associative infix operator, we would instead use foldl1.

    Note that parser functions always parse all stronger expressions as well. So pFun falls back on pApp when there is no -> (because sepBy1 accepts the case with no separators), and pApp falls back on pClosed when there's no ADT application.