haskell garbage-collection ghc language-design

Is there inherent "cost of carry" of garbage thunks in Haskell?

I often see high number of cycles spent in GC when running GHC-compiled programs.

These numbers tend to be order of magnitude higher than my JVM experience suggests they should be. In particular, number of bytes "copied" by GC seems to be vastly larger than amounts of data I'm computing.

Is such difference between non- and strict languages fundamental?

Solution

tl;dr: Most of the stuff that the JVM does in stack frames, GHC does on the heap. If you wanted to compare GHC heap/GC stats with the JVM equivalent, you'd really need to account for some portion of the bytes/cycles the JVM spends pushing arguments on the stack or copying return values between stack frames.

Long version:

Languages targeting the JVM typically make use of its call stack. Each invoked method has an active stack frame that includes storage for the parameters passed to it, additional local variables, and temporary results, plus room for an "operand stack" used for passing arguments to and receiving results from other methods it calls.

As a simple example, if the Haskell code:

bar :: Int -> Int -> Int
bar a b = a * b
foo :: Int -> Int -> Int -> Int
foo x y z = let u = bar y z in x + u

were compiled to JVM, the byte code would probably look something like:

public static int bar(int, int);
  Code:
    stack=2, locals=2, args_size=2
       0: iload_0   // push a
       1: iload_1   // push b
       2: imul      // multiply and push result
       3: ireturn   // pop result and return it

public static int foo(int, int, int);
  Code:
    stack=2, locals=4, args_size=3
       0: iload_1   // push y
       1: iload_2   // push z
       2: invokestatic bar   // call bar, pushing result
       5: istore_3  // pop and save to "u"
       6: iload_0   // push x
       7: iload_3   // push u
       8: iadd      // add and push result
       9: ireturn   // pop result and return it

Note that calls to built-in primitives like imul and user-defined methods like bar involve copying/pushing the parameter values from local storage to the operand stack (using iload instructions) and then invoking the primitive or method. Return values then need to be saved/popped to local storage (with istore) or returned to the caller with ireturn; occasionally, a return value can be left on the stack to serve as an operand for another method invocation. Also, while it's not explicit in the byte code, the ireturn instruction involves a copy, from the callee's operand stack to the caller's operand stack. Of course, in actual JVM implementations, various optimizations are presumably possible to reduce copying.

When something else eventually calls foo to produce a computation, for example:

some_caller t = foo (1+3) (2+4) t + 1

the (unoptimized) code might look like:

       iconst_1
       iconst_3
       iadd      // put 1+3 on the stack
       iconst_2
       iconst_4
       iadd      // put 2+4 on the stack
       iload_0   // put t on the stack
       invokestatic foo
       iconst 1
       iadd
       ireturn

Again, subexpressions are evaluated with a lot of pushing and popping on the operand stack. Eventually, foo is invoked with its arguments pushed on the stack and its result popped off for further processing.

All of this allocation and copying takes place on this stack, so there's no heap allocation involved in this example.

Now, what happens if that same code is compiled with GHC 8.6.4 (without optimization and on an x86_64 architecture for the sake of concreteness)? Well, the pseudocode for the generated assembly is something like:

foo [x, y, z] =
    u = new THUNK(sat_u)                   // thunk, 32 bytes on heap
    jump: (+) x u

sat_u [] =                                 // saturated closure for "bar y z"
    push UPDATE(sat_u)                     // update frame, 16 bytes on stack
    jump: bar y z

bar [a, b] =
    jump: (*) a b

The calls/jumps to the (+) and (*) "primitives" are actually more complicated than I've made them out to be because of the typeclass that's involved. For example, the jump to (+) looks more like:

    push CONTINUATION(\f -> f x u)         // continuation, 24 bytes on stack
    jump: (+) dNumInt                      // get the right (+) from typeclass instance

If you turn on -O2, GHC optimizes away this more complicated call, but it also optimizes away everything else that's interesting about this example, so for the sake of argument, let's pretend the pseudocode above is accurate.

Again, foo isn't of much use until someone calls it. For the some_caller example above, the portion of code that calls foo will look something like:

some_caller [t] =
    ...
    foocall = new THUNK(sat_foocall)       // thunk, 24 bytes on heap
    ...

sat_foocall [] =                           // saturated closure for "foo (1+3) (2+4) t"
    ...
    v = new THUNK(sat_v)                   // thunk "1+3", 16 bytes on heap
    w = new THUNK(sat_w)                   // thunk "2+4", 16 bytes on heap
    push UPDATE(sat_foocall)               // update frame, 16 bytes on stack
    jump: foo sat_v sat_w t

sat_v [] = ...
sat_w [] = ...

Note that nearly all of this allocation and copying takes place on the heap, rather than the stack.

Now, let's compare these two approaches. At first blush, it looks like the culprit really is lazy evaluation. We're creating these thunks all over the place that wouldn't be necessary if evaluation was strict, right? But let's look at one of these thunks more carefully. Consider the thunk for sat_u in the definition of foo. It's 32 bytes / 4 words with the following contents:

// THUNK(sat_u)
word 0:  ptr to sat_u info table/code
     1:  space for return value
     // variables we closed over:
     2:  ptr to "y"
     3:  ptr to "z"

The creation of this thunk isn't fundamentally different than the JVM code:

       0: iload_1   // push y
       1: iload_2   // push z
       2: invokestatic bar   // call bar, pushing result
       5: istore_3  // pop and save to "u"

Instead of pushing y and z onto the operand stack, we loaded them into a heap-allocated thunk. Instead of popping the result off the operand stack into our stack frame's local storage and managing stack frames and return addresses, we left space for the result in the thunk and pushed a 16-byte update frame onto the stack before transferring control to bar.

Similarly, in the call to foo in some_caller, instead of evaluating the argument subexpressions by pushing constants on the stack and invoking primitives to push results on the stack, we created thunks on the heap, each of which included a pointer to info table / code for invoking primitives on those arguments and space for the return value; an update frame replaced the stack bookkeeping and result copying implicit in the JVM version.

Ultimately, thunks and update frames are GHC's replacement for stack-based parameter and result passing, local variables, and temporary workspace. A lot of activity that takes place in JVM stack frames takes place in the GHC heap.

Now, most of the stuff in JVM stack frames and on the GHC heap quickly becomes garbage. The main difference is that in the JVM, stack frames are automatically tossed out when a function returns, after the runtime has copied the important stuff out (e.g., return values). In GHC, the heap needs to be garbage collected. As others have noted, the GHC runtime is built around the idea that the vast majority of heap objects will immediately become garbage: a fast bump allocator is used for initial heap object allocation, and instead of copying out the important stuff every time a function returns (as for the JVM), the garbage collector copies it out when the bump heap gets kind of full.

Obviously, the above toy example is ridiculous. In particular, things are going to get much more complicated when we start talking about code that operates on Java objects and Haskell ADTs, rather than Ints. However, it serves to illustrate the point that a direct comparison of heap usage and GC cycles between GHC and JVM doesn't make a whole lot of sense. Certainly, an exact accounting doesn't really seem possible as the JVM and GHC approaches are too fundamentally different, and the proof would be in real-world performance. At the very least, an apples-to-apples comparison of GHC heap usage and GC stats needs to account for some portion of the cycles the JVM spends pushing, popping, and copying values between operand stacks. In particular, at least some fraction of JVM return instructions should count towards GHC's "bytes copied".

As for the contribution of "laziness" to heap usage (and heap "garbage" in particular), it seems hard to isolate. Thunks really play a dual role as a replacement for stack-based operand passing and as a mechanism for deferred evaluation. Certainly a switch from laziness to strictness can reduce garbage -- instead of first creating a thunk and then eventually evaluating it to another closure (e.g., a constructor), you can just create the evaluated closure directly -- but that just means that instead of your simple program allocating a mind-blowing 172 gigabytes on the heap, maybe the strict version "only" allocates a modest 84 gigabytes.

As far as I can see, the specific contribution of lazy evaluation to "bytes copied" should be minimal -- if a closure is important at GC time, it will need to be copied. If it's still an unevaluated thunk, the thunk will be copied. If it's been evaluated, just the final closure will need to be copied. If anything, since thunks for complicated structures are much smaller than their evaluated versions, laziness should typically reduce bytes copied. Instead, the usual big win with strictness is that it allows certain heap objects (or stack objects) to become garbage faster so we don't end up with space leaks.