Search code examples
performancehaskelllazy-evaluationalgebraic-data-types

Are sum types defined with UnboxedSums more efficient than plain enum?


For example, here is simple Haskell enum data type:

data Bool = False | True

There's -XUnboxedSums extension since GHC-8.2.1 that allows to define sum types in more memory-efficient way. Here is a quote from documentation:

In the degenerate case where all the alternatives have zero width, such as the Bool-like (# (# #) | (# #) #), the unboxed sum layout only has an Int32 tag field (i.e., the whole thing is represented by an integer).

I wonder, whether this representation is more efficient than using simple plain Haskell enum data type?

And here is the full documentation:


Solution

  • Semantics and Representation

    The two types cannot be used interchangeable.

    The type

    data Bool = False | True
    

    is a lifted type. This means that a variable x :: Bool can still be unevaluated (e.g. a thunk). It will therefore be represented as a pointer.

    In contrast

    (# (# #) | (# #) #)
    

    is unlifted, and really can only have the two values, and will be represented as just a machine integer.

    Space efficiency

    This does, however not mean that the latter is more space efficient: Because True and False are nullary constructors (they take no arguments), they exist once and for all in the static code of your program, and the pointers just point to them. So the cost is one machine word in all cases.

    Run time efficiency

    The unboxed variant may be slightly more efficient than Bool:

    • For Bool, the code first has to check is this already evaluated?, and then it can branch on which constructor it is.

      The branching is pretty efficient: The code does not have to follow the pointer for that: The “pointer tag” in the low few bits of the pointer address will indicate which constructor it is. There is a slight cost for masking the other bits, though.

    • In the unboxed variant, well, it's just 0 or 1, and no thunk check is needed.

    Things that might be true one day

    jberryman says: as of Jan 2021 the following don't appear to be implemented yet, but this is how things might one day work.

    Unboxing in data types

    If you define a data type like this

    data Foo = Foo {-# UNPACK #-} !Bool
    

    you should get exactly the same result as if you’d write

    data Foo = Foo (# (# #) | (# #) #)
    

    as that was the main purpose of the unboxed sum extension.

    Unboxing in functions

    If you define a function that is strict in a Bool argument, e.g.

    foo True = "Hello"
    foo False = "World!"
    

    then I could imagine (but did not check) that the compiler optimizes that into a worker that takes a (# (# #) | (# #) #) and a wrapper that takes care of the thunk-check. The wrapper might then inline in use-sites where foo is used, and everything might end up in terms of (# (# #) | (# #) #).

    Conclusion

    Don’t bother.