Search code examples
haskellpattern-matchingtypecheckingapplicativegadt

In Haskell, how to parse an untyped AST to a typed one based on a GADT?


I'm writing a domain-specific language in Haskell, and have settled on a design with two ASTs: an initial untyped one that represents syntax and a final typed one that represents everything. I'm writing the final one as a GADT to get better typechecking.

I think it's almost working, but I'm having trouble writing a function that converts initial -> final (checks the types, plus some other things not shown, like all references correspond to a variable).

Here's a simplified example:

{-# LANGUAGE GADTs, StandaloneDeriving #-}

module Main where

-- untyped initial AST

data Untyped
  = UNum Int
  | UStr String
  | UAdd Untyped Untyped
  deriving (Show, Eq)

-- typed final AST

data Typed a where
  TNum :: Int    -> Typed Int
  TStr :: String -> Typed String
  TAdd :: Typed Int -> Typed Int -> Typed Int

deriving instance Eq   (Typed a)
deriving instance Show (Typed a)

-- wrapper that allows working with a `Typed a` for any `a`
data TypedExpr where
  TypedExpr :: Typed a -> TypedExpr

And this is my attempt at the check function. The base cases are simple:

check :: Untyped -> Either String TypedExpr
check (UNum n) = Right $ TypedExpr $ TNum n
check (UStr s) = Right $ TypedExpr $ TStr s
-- check (Uadd e1 e2) = ???

But how do I add Add? It can recursively evaluate the sub-expressions to values of type Either String (TypedExpr (Typed a)), but I haven't managed to unwrap those, check that the types line up (both as should be Ints), and re-wrap afterward. I planned to do it all with big pattern matches, but GHC doesn't approve:

My brain just exploded
  I can't handle pattern bindings for existential or GADT data constructors.
  Instead, use a case-expression, or do-notation, to unpack the constructor.

That's explained here, but I didn't understand the explanation. It seems I don't want pattern matching.

Update: I must have messed something else up without noticing. Pattern matching works, as shown by Nikita.

So I fiddled around trying to force things into the right shape, but haven't gotten anything substantial yet. If these were just Either String SomeValue I would want to use applicatives, right? Is it possible to add another level of unwrapping + type checking to them? I suspect this answer is close to what I want since the question is very similar, but again I don't understand it. This may also be related.

Update: That first answer is just what I wanted after all. But I couldn't see how until chi wrote the intermediate version below without the extra Type type. Here's a working solution. The trick was to tag TypedExprs with a new type representing just the return type (a) of a Typed a:

data Returns a where
  RNum :: Returns Int
  RStr :: Returns String

-- extend TypedExpr to include the return type
data TypedExpr2 where
  TypedExpr2 :: Returns a -> Typed a -> TypedExpr2

That way check doesn't have to know whether each subexpression is a raw TNum or a function (like Add) that returns a TNum:

check :: Untyped -> Either String TypedExpr2
check (UNum n) = Right $ TypedExpr2 RNum (TNum n)
check (UStr s) = Right $ TypedExpr2 RStr (TStr s)
check (UAdd u1 u2) = do
  -- typecheck subexpressions, then unwrap by pattern matching
  TypedExpr2 r1 t1 <- check u1
  TypedExpr2 r2 t2 <- check u2
  -- check the tags to find out their return types
  case (r1, r2) of
    -- if correct, create an overall expression tagged with its return type
    (RNum, RNum) -> return $ TypedExpr2 RNum $ TAdd t1 t2
    _ -> Left "type error"

GHC is smart enough to know that the two as in any TypedExpr2 must match, so it catches you if you attempt to use the wrong overall return type at the end. Wonderful!


Solution

  • Technically it is doable, but it's quite inconvenient: you have to "dig" until the GADT constructor is found.

    check :: Untyped -> Either String TypedExpr
    check (UNum n) = return $ TypedExpr $ TNum n
    check (UStr s) = return $ TypedExpr $ TStr s
    check (UAdd t1 t2) = do
        t1 <- check t1
        t2 <- check t2
        case (t1, t2) of
          (TypedExpr (TNum x)     , TypedExpr (TNum y))
             -> return $ TypedExpr $ TAdd (TNum x    ) (TNum y)
          (TypedExpr (TAdd x1 x2) , TypedExpr (TNum y))
             -> return $ TypedExpr $ TAdd (TAdd x1 x2) (TNum y)
          (TypedExpr (TNum x)     , TypedExpr (TAdd y1 y2))
             -> return $ TypedExpr $ TAdd (TNum x    ) (TAdd y1 y2)
          (TypedExpr (TAdd x1 x2) , TypedExpr (TAdd y1 y2))
             -> return $ TypedExpr $ TAdd (TAdd x1 x2) (TAdd y1 y2)
          _ -> Left "type error"
    

    I would look for alternatives. The above approach suffers from combinatorial explosion when the number of constructors is large.