Search code examples
haskellboilerplate

How can I remove all the boilerplate introduced by Trees That Grow?


I'm trying to define a programming language in Haskell. I wish to make the AST extensible: users of the AST module (for instance a pretty printer, an interpreter, a compiler, a type system, a language server and so on) should be able to extend it, by adding both new functionalities and new data (new datatypes to extend the syntax as well as new fields to the current data constructors to store data needed by the various components).

I tried to achieve this goal by using Trees That Grow (TTG). It works, but it results in way too much boilerplate. My minimal prototype becomes 10 times larger in terms of lines of code, and this number grows by the AST size times the number of extensions. Changing something minor requires changing several lines of the AST module, while changing something in the way extensibility is implemented would require rewriting most of it.

Is there any way to reduce the amount of boilerplate needed, or ideally remove it altogether?

Example with code of what I have so far

The "base" AST

This is just one small piece of the AST. It's something very similar to JSON, as I decided to start with a small prototype.

module AST ( KeyValue(..), Data(..) ) where

data KeyValue = KV String Data deriving (Show, Eq, Ord)

data Data =
    Null |
    Int Int |
    Num Double |
    Bool Bool |
    String String |
    Array [Data] |
    Object [KeyValue] deriving (Show, Eq, Ord)

The extensible AST, via Trees That Grow

In order to extend it via TTG, the datatypes become something like this:

data KeyValueX x =
    KVX (XKV x) String (DataX x) |
    KeyValueX (XKeyValue x)
 
data DataX x =
    NullX (XNull x) |
    IntX (XInt x) Int |
    NumX (XNum x) Double |
    BoolX (XBool x) Bool |
    StringX (XString x) String |
    ArrayX (XArray x) [DataX x] |
    ObjectX (XObject x) [KeyValueX x] |
    DataX (XData x)

Each of those types with a name starting in X is a type family:

type family XKV x
type family XKeyValue x
type family XNull x
type family XInt x
type family XNum x
type family XBool x
type family XString x
type family XArray x
type family XObject x
type family XData x

Further each of them requires to be listed in a type that makes it easier to derive classes:

type ForallX (c :: Type -> Constraint) x = (
    c (XKV x), c (XKeyValue x),
    c (XNull x), c (XInt x), c (XNum x), c (XBool x),
    c (XString x), c (XArray x), c (XObject x), c (XData x)
    )

-- now we can do:
deriving instance ForallX Show x => Show (KeyValueX x)
deriving instance ForallX Show x => Show (DataX x)
deriving instance ForallX Eq x => Eq (KeyValueX x)
deriving instance ForallX Eq x => Eq (DataX x)
deriving instance ForallX Ord x => Ord (KeyValueX x)
deriving instance ForallX Ord x => Ord (DataX x)

And of course everything requires to be exported:

module AST ( KeyValueX(..), DataX(..),
             XKV, XKeyValue,
             XNull, XNum, XBool, XString, XArray, XObject, XData,
             ForallX
           ) where

An extension to the AST

This is what is needed in order to create an extension. Even just the "identity" extension (UnDecorated) which needs to be provided.

For every instance you need to implement a typeclass for the type family of every type and data constructor:

data UD  -- UnDecorated, identity extension
 
type instance XKV UD = ()
type instance XKeyValue UD = Void
 
type instance XData UD = Void
type instance XNull UD = ()
type instance XInt UD = ()
type instance XNum UD = ()
type instance XBool UD = ()
type instance XString UD = ()
type instance XArray UD = ()
type instance XObject UD = ()

Then, in order to do things properly and ergonomic enough for the user, you need patterns and type aliases for every data constructor and data type:

type KeyValue = KeyValueX UD
pattern KV :: String -> Data -> KeyValue
pattern KV x y <- KVX _ x y where KV x y = KVX () x y
 
type Data = DataX UD
pattern Null :: Data
pattern Null <- NullX _ where Null = NullX ()
pattern DInt :: Int -> Data
pattern DInt x <- IntX _ x where DInt x = IntX () x
pattern DNum :: Double -> Data
pattern DNum x <- NumX _ x where DNum x = NumX () x
pattern DBool :: Bool -> Data
pattern DBool x <- BoolX _ x where DBool x = BoolX () x
pattern DString :: String -> Data
pattern DString x <- StringX _ x where DString x = StringX () x
pattern Array :: [Data] -> Data
pattern Array x <- ArrayX _ x where Array x = ArrayX () x
pattern Object :: [KeyValue] -> Data
pattern Object x <- ObjectX _ x where Object x = ObjectX () x

And of course all this stuff should be exported too:

module AST ( ...,
              UD,
              KeyValue, Data,
              pattern KV,
              pattern Null, pattern Num, pattern Bool,
              pattern String, pattern Array, pattern Object
           ) where

Summary

TTG turned my simple 10-line module, into a module of more than 100 lines where 90% of the code is boring, hard-to-maintain boilerplate:

  • The original (unextensible) AST module took around 10 lines.
  • The AST for the extensible version ended up taking about 50 lines and each of the data constructors (including their related type families) is mentioned around 4 times.
    • On top of that, every AST extension (including the required "identity" one) takes another 50 lines and mentions each of the data constructor another 3 times.

I would estimate that the whole language could take a couple dozen types with a total of more than a hundred of data constructors. Then I would need to define a handful of extensions to the AST. A non-extensible AST would take around 100 lines (as an order of magnitude), while one extended via TTG would take around 10,000. All the required boilerplate would make all of this unmanageable for me.

Question

Is there any way to reduce the amount of boilerplate needed, or ideally remove it altogether?

Otherwise are there any alternative ways to make my AST extensible without requiring this much work?


Solution

  • You can merge all of the type families into one indexed by a symbol:

    data KeyValueX x =
        KVX (X "KVX" x) String (DataX x) |
        KeyValueX (X "KeyValueX" x)
      deriving Generic
     
    data DataX x =
        NullX (X "NullX" x) |
        IntX (X "IntX" x) Int |
        NumX (X "NumX" x) Double |
        BoolX (X "BoolX" x) Bool |
        StringX (X "StringX" x) String |
        ArrayX (X "ArrayX" x) [DataX x] |
        ObjectX (X "ObjectX" x) [KeyValueX x] |
        DataX (X "DataX" x)
      deriving Generic
    
    --
    
    type family X (s :: k) (x :: l) :: Type
    

    Use generics to grab all of the constructor names:

    type ForAllX c x = (AllX c (CNames (DataX x)) x, AllX c (CNames (KeyValueX x)) x)
    
    deriving instance ForAllX Eq x => Eq (DataX x)
    deriving instance ForAllX Eq x => Eq (KeyValueX x)
    
    -- CNames defined using generics, below
    

    All of the boilerplate up to that point could also be generated from the "base AST" using Template Haskell.

    Having only one type family makes it easy to define extensions with catch-all clauses:

    data UD
    
    type instance X s UD = XUD s
    
    type family XUD (s :: Symbol) :: Type where
      XUD "KeyValueX" = Void
      XUD "DataX" = Void
      XUD _ = ()
    

    As for the patterns, maybe just exposing the constructors is not so bad? GHC does that.

    Imports and generics code to make this answer self-contained:

    {-# LANGUAGE
      DataKinds,
      DeriveGeneric,
      PolyKinds,
      StandaloneDeriving,
      TypeFamilies,
      UndecidableInstances #-}
    module T where
    
    import Data.Kind (Constraint, Type)
    import Data.Void
    import GHC.Generics
    import GHC.TypeLits
    
    type CNames a = GCNames (Rep a)
    
    type family GCNames (f :: Type -> Type) :: [Symbol] where
      GCNames (M1 D c f) = GCNames f
      GCNames (f :+: g) = GCNames f ++ GCNames g
      GCNames (M1 C (MetaCons name _ _) f) = '[name]
    
    type family (xs :: [k]) ++ (ys :: [k]) :: [k] where
      '[] ++ ys = ys
      (x ': xs) ++ ys = x ': (xs ++ ys)
    
    type family AllX (c :: Type -> Constraint) (xs :: [Symbol]) (x :: l) :: Constraint where
      AllX c '[] x = ()
      AllX c (s ': ss) x = (c (X s x), AllX c ss x)
    

    Gist: https://gist.github.com/Lysxia/3f6781b3a307a7e0c564920d6277bee2