Search code examples
haskellpolymorphism

Restricting a type variable to a class type in a data declaration in Haskell


I want to declare a data type that is constructed from a list of comparable elements. This is because I wrote a function that only works correctly if its input lists are sorted, and I wanted a way to get the compiler to stop its being accidentally used on unsorted lists. (I don't think there's any way to stop a user lying, but I at least wanted them to declare the lists as sorted).

It doesn't make sense (to me) to have a sorted list unless the elements of the list are in Ord, and I wanted to try to get the compiler to enforce at least this much.

data WrappedList a = WrappedList [a]

data SortedList (Ord a) => a = SortedList [a]

SortedList is one of my failed attempts, whilst WrappedList does actually compile. I can't find any examples of this, so maybe I've missed the point completely?

(N.B. I've just starting learning Haskell and this is a toy problem.)


Solution

  • Assuming you could create your SortedList type, the function you want to write will most likely have a type that looks a bit like:

    myFunc :: Ord a => SortedList a -> ...
    

    If myFunc doesn't have the Ord constraint then it won't be able to use comparison functions (which would mean it's completely unable to observe whether the list is sorted or not; that doesn't make the fact that it's sorted completely useless, but certainly you would lose a large part of the utility of knowing it was already sorted).

    That means that if someone wants to call myFunc with a type that doesn't have an Ord constraint, the compiler is going to call them on it anyway. So stopping them from constructing the SortedList a if they can't prove Ord a doesn't actually catch any additional errors. So simply adding an Ord a constraint to SortedList itself is probably not something you should put a lot of effort into; it actually buys you very little. A simple type like this:

    data SortedList a = SortedList [a]
    
    -- Actually it could be a newtype, which is a bit more efficient; but don't worry
    -- about it if you don't know what that is yet
    newtype SortedList a = SortedList [a]
    

    works fine to create the situation where it's hard to accidentally call your function on an unsorted list (the caller would have to deliberately lie to you, or at least be negligent about assuming a list from elsewhere was sorted), and the Ord constraints on functions that do anything interesting with the order in a SortedList will catch anyone using your functions with lists that actually can't be sorted (by a canonical order) because their element type doesn't even have such an order.

    Haskell in fact used to have a feature for doing exactly what you're after, but after the experience of having it the expert and community consensus was that it is better not to have the feature at all. It has been deprecated, is no longer available by default in newer GHC versions, and will be removed entirely at some point. So since you are learning Haskell, I recommend you simply never learn how to use this feature. Learning to solve problems without it is a useful skill that will carry over into future Haskell; learning to rely on it is a dead end.


    If you really want the Ord a check to be made at the point where the SortedList wrapper is constructed the way to do that in more modern Haskell is to use the GADTs language extension. This is however a more advanced Haskell feature, so probably not something you should try to tackle when you're still learning the ropes.

    But for completeness, it would allow you to write a type like this:

    data SortedList a where
      SortedList :: Ord a => [a] -> SortedList a
    

    This means that when the SortedList constructor is applied the compiler will not just check Ord a, it will actually store the Ord instance inside the SortedList value. Functions that use a SortedList then don't actually need an Ord constraint, because they get access to the stored instance when pattern matching on the SortedList.

    Be careful about using this technique everywhere though; it can cause significant performance issues. If you use many values with a stored instance you'll potentially use a lot of memory storing references to the same instance (and time dereferencing all those references). But even worse this technique defeats a lot of optimisations the compiler can usually make. The compiler can often inline and specialise functions with typeclass constraints so that they end up calling statically-known typeclass methods, which can be much faster than calling them through a runtime dictionary. When you're managing the dictionary passing (by having them stored in GADTs) instead of the compiler, it can become much harder for the compiler to make these optimisations.

    If you dig further into GADTs you'll also find that they let you "hide" type parameters, and that opens a large can of worms that you need a very firm grasp on Haskell's type system and syntax to keep hold of. Or at least the error messages you're likely to get when something goes wrong are very confusing to a novice, which makes them hard to correct.

    GADTs are a really powerful feature that enables ways of structuring your code that simply couldn't be done with vanilla Haskell data types. My personal rubrik is to save them for cases like that, and not use them merely to catch an error a bit earlier that the compiler would have caught anyway. That's a cost-benefit tradeoff you'll have to make for yourself. But certainly GADTs are well worth learning and mastering at some point in your Haskell journey!


    If you want to go further and actually have a data type that the compiler enforces to be a sorted list, then there in fact are ways to do that. The most straightforward way doesn't even need GADTs! The trick is to use the module system. (If you're not already comfortable with writing multi-module programs, then probably save this for later; your instinct to make the compiler enforce your invariants is very good Haskell, but you won't be able to do everything that Haskell is capable of while you're just getting started)

    In a separate module from your other code (i.e. in a file SortedList.hs), write something like this:

    module SortedList
      ( SortedList   -- note this just exports the *type* name, not the constructor
      , fromUnsorted
      , empty
      , elements
      , sortedInsert
      , unsafeFromSorted
      )
    where
    
    import Data.List (sort, insert)
    
    newtype SortedList a = SL [a]
    
    fromUnsorted :: Ord a => [a] -> SortedList a
    fromUnsorted = SL . sort
    
    empty :: SortedList a
    empty = SL []
    
    elements :: SortedList a -> [a]
    elements (SL xs) = xs
    
    sortedInsert :: Ord a => a -> SortedList a -> SortedList a
    sortedInsert x (SL xs) = SL $ insert x xs
    
    -- Optionally include a function like this to allow a programmer to declare
    -- that a list is *already* sorted. Having an option like this is equivalent
    -- to exporting the constructor, so it loses the guarantee that the list is
    -- *always* sorted. But there are often ways to build a list that is sorted by
    -- construction (e.g. [1..100]), so having to call `sort` on them is a
    -- performance hit. Take your pick between performance and safety.
    unsafeFromSorted :: [a] -> SortedList a
    unsafeFromSorted = SL
    

    The key thing is that we haven't exported the constructor SL (which I've named differently from the SortedList only to avoid confusion when I'm talking about them). Nobody outside this module can apply SL to a random unsorted list. They would have to use the fromUnsorted function which is going to sort the list. Alternatively if they already have a SortedList they can use sortedInsert to add a new element, but that maintains the property that the output list is sorted if the input was. (Or they can use empty, since an empty list is obviously always sorted)

    If those are the only ways that other parts of the program can get access to a SortedList, then you always know that such a list is sorted. The elements function extracts the elements as a raw list (needed since the constructor isn't available so nobody can pattern match to get them); this throws away guarantee but allows full access to ordinary list functions.

    The sortedInsert function isn't strictly necessary; you could always use fromUnsorted . insert x . elements to do the same thing. But that is much less efficient, since it requires re-checking that the list is sorted to convert it back to a SortedList at the end, and we know that insert is always going to produce a sorted result if the input is sorted. So including an analogue of insert makes it easier and more efficient to work with SortedLists. Similarly you could think of other list operations that can be done (more efficiently) in a way that maintains the sorted invariant and include those. The more of this you do, the less need there is for something like unsafeFromSorted, as people will be able to work with SortedLists directly instead of working with ordinary lists and then converting them at the end.