java garbage-collection java-native-interface

Efficient GC-assisted cleanup of LARGE native resources

I'm currently attempting to write a tensor-processing/deep learning library in Java similar to PyTorch or Tensorflow.

Tensors reference MemoryHandles, which hold the native memory needed for the tensor data. During training, tensor instances are created rapidly, but never the less, the JVM heap itself stays about 100Mb-200Mb and thus the garbage collector is never prompted to garbage collect. This results in the memory footprint of the application exploding and consuming upwards of 16GB of RAM, due to how much native memory is needed to store the tensor data.

The memory handles themselves are allocated via a cental MemoryManager, which creates PhantomReferences to the handed out handles, and after the object is garbage collected, the associated native memory is correctly freed.

What makes this problem hard

Why is the GC not smart enough to instantly clean these tensors?

Operations such as .matmul(), .plus() etc. are not immediately executed, but rather recorded into a Graph, where nodes represent either variables or operations. This graph is necessary for backpropagation and thus creating it is not optional. This creates a rather complicated reference structure that is hard to unravel for a GC.

Attempted solutions

I have attempted various less then ideal ways to fix this problem:

Insanely small JVM heap size

-Xmx100M

By forcing the Garbage collector to work with insanely low heap sizes, the garbage collector keeps the native memory footprint bearable. This introduces very little slow down to the training loop in the cases I have evaluated and would be bearable, if finding out that ideal MB to make the GC do what you want wasn't so painful. Also, if the memory usage of your application isn't more or less constant, this approach also bursts into flames.

Periodic full gc

Running a full gc for every X Mb of natively allocated memory. This introduces abysmal slow down to the training loop in the cases I have evaluated. This is the only "in-application" fix that I can think of, meaning, that the user is not forced to use weird jvm args when running their program. While -XX:+UseZGC and -XX:+ExplicitGCInvokesConcurrent show some improvement, the situation remains rather bad.

Both these solutions do in fact keep the memory footprint of the application at bay, which goes to show that IF the GC catches all the un-referenced MemoryHandles, everything is freed correctly.

Thus my question:

When Jvm applications experience high allocation rates, the GC usually kicks in hard. Now the problem here is that we have effectively high allocation rates, but that is not at all reflected in the JVM heap. If you put yourself into the shoes of the Garbage Collector, the least that you suspect is that freeing a java object solely consisting of an 8 byte long is where you should place your efforts. If however it was possible to hint the GC to try harder to free objects of the MemoryHandle type, I suspect these problems would largely disappear. So my question would be: Is this possible? I wouldn't mind writing hacky native code, if necessary.

Another idea would be to use some jvm argument to make the full GC less aggressive, more in line with the slight slowdown that I experienced with -Xmx100m .

If this is in fact not possible, are there alternative solutions to sovling this problem? Surely I can't be the first person to attempt to write a Java library with large native resources.

Solution

I think that I have now figured out a solution that works as good as it can.

The problem

If you face a similar issue you probably have code that fits some of these criterias:

A high allocation rate of small objects, which hold large native resources
Objects referencing each other in complicated ways that is hard for the GC to untangle
No place in the code where you can safely determine that the resources are no longer in use

Requirements for a potential solution

Your requirements probably are:

Don't bottleneck the loop that allocates the native handles
Nearly instantanious cleanup after the native handle becomes unreferenced

The tradeoff

It turns out you cannot accomplish both these requirements at once. You unfortunately have to choose between one or the other. If you don't want to bottleneck the loop that allocates these native handles at a high rate, you need to trade RAM to do that. If you want instantatious cleanup after the native handle becomes unreferenced, you have to sacrifice the execution speed of the code that allocates the native handles.

The (hacky) solution

Create a mechanism such that you can asynchronously request a full GC to be performed.

    private final AtomicBoolean shouldRunGC = new AtomicBoolean(false);

    private final Thread gcThread = new Thread(() -> {
        while (true) {
            try {
                Thread.sleep(10);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
            if (shouldRunGC.getAndSet(false)) {
                System.gc();
            }
        }
    }, "GC-Invoker-Thread");

    {
        gcThread.setDaemon(true);
        gcThread.start();
    }

Ideally, you have a region of code that is loosely associated with cleanup of these handle objects. It doesn't have to mean that these objects can be safely disposed at this point in time, it just has to mean that the object is >probably< safe to delete. This callsite merely serves a statistical metric to determine the best intervall in which to trigger the Garbage Collection. You should also know the size of your native resource, or alternatively an estimate of how bad it would be to keep a given object arround.

Alternatively you could also place this at the point of the allocation of your native handles, but note that the effectiveness of the statistical metric that you collect is less effective.

This is an example of such a method in my tensor processing library Sci-Core:

    /**
     * Drops the history of how this tensor was computed.
     * This is useful e.g. when the tensor was changed by the optimizer
     * and thus backpropagation back into the last training step (wtf) would be brain-dead.
     * Thus, we no longer need to keep a record of how the tensor was computed.
     * Executes all operations to compute the value of the specified tensor contained in the graph, if it is not already computed.
     * @param tensor the tensor to drop the computation history for
     */
    public void dropHistory(ITensor tensor) {
      // for all nodes now dropped from the graph
      ...
          nBytesDeletedSinceLastAsyncGC += value.getNumBytes();
          nBytesDeletedSinceLastOnSameThreadGC += value.getNumBytes();
      ...

       if (nBytesDeletedSinceLastAsyncGC > 100_000_000) { // 100 Mb
           shouldRunGC.set(true);
           nBytesDeletedSinceLastAsyncGC = 0;
       }
       if (nBytesDeletedSinceLastOnSameThreadGC > 2_000_000_000) { // 2 GB
           System.gc();
           nBytesDeletedSinceLastOnSameThreadGC = 0;
       }
    }

To fight against bottlenecking your allocation loop, you can use the following JVM arguments:

-XX:+UseZGC -XX:+ExplicitGCInvokesConcurrent -XX:MaxGCPauseMillis=1

Why would this work?

Triggering regular garbage collection seems to make the garbage collector interested in cleaning the very small handle objects (among basically every other object that you create in your application. You still don't have "prioritization" for your handles, they just happen to also be garbage collected. If your application in addition to the native handle objects also allocates a significant amount of other small objects, the effectiveness of this technique will be significantly reduced.

Note however, that triggering the Garbage collector is expensive and thus the maximum value for nBytesDeletedSinceLastAsyncGC and nBytesDeletedSinceLastOnSameThreadGC must be carefully chosen. Running the garbage collector asynchronously is less expensive, as it will not bottleneck your allocation loop very much but also less effective than calling the garbage collector on the same thread the objects are allocated. So, doing both in carefully chosen intervals can probably get you a good compromise between execution speed of your allocation loop and memory footprint.