How do I force cupy to free all gpu memory after going out of scope?

I have a memory intensive gpu-based (CUDA C++ linked with cython) model to execute that has a substantial preprocessing step before running. Until now, the preprocessing step has been done on the cpu, with the results being then passed to the gpu for the intensive computation, however the runtime for the preprocessing has blown out substantially, and I am looking to speed it up.

Due to the huge gpu memory requirements of the model, what I am trying to do is run the bulk(99%) of the preprocessing in cupy, temporarily bring the results back to the cpu, completely free all gpu allocations made by cupy, and then jump into the cython/c++ code (loading the results which I swapped out to CPU).

To illustrate the point, I have about 500MB free on each of the 8 gpus (DGX A100) when running the model as-is - so it's simply not possible for me to have any residual cupy allocations when proceeding with the actual model.

What I have tried to do is to put the preprocessing stage in a sub function - which should, in theory, go out of scope & get cleaned up after finishing. This does not seem to be happening however:

testcupy.py

import numpy as np
import testfunction
out = testfunction.run()
import gc
gc.collect()
#I want all gpu memory to be entirely cleaned up at this stage, out is a numpy array, 
#I have no need for any more gpu allocations
input()

testfunction.py

import cupy as cp
def run():
    X = cp.random.rand(10000000,300)
    y = cp.random.randn(300)
    out = X.dot(y)
    out_hst = cp.asnumpy(out)
    del X
    del y
    del out
    cp._default_memory_pool.free_all_blocks()
    #without these 'del' lines the residual memory is circa 40GB
    
    #I would expect all cupy arrays to go out of scope here & get cleaned up
    #(or at least be available for cleanup by GC)
    return out_hst

Without the del lines and free_all_blocks you get this:

nvidia-smi -i 0
Sat Apr 15 21:04:58 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   28C    P0    68W / 400W |  41671MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

(40GB! of outstanding allocations)

With the del lines, you get this:

nvidia-smi -i 0
Sat Apr 15 21:08:45 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   29C    P0    84W / 400W |    799MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

A substantial improvement, but still has not been entirely cleaned up. I would expect the cupy instance to be completely gone (0-2MB used), but it seems to be holding onto some substantial allocation.

It's also worth pointing out, in more complex scripts cupy is occasionally calling cublas, or using scratch space in the background, and, as such, it's not always possible to call del on all pieces of memory being used. In practice I've found tens of gigabytes to be left hanging once leaving the scope of the cupy source file, and unable to be cleared up by gc.

Can anyone recommend a way of completely ditching all cupy allocations at the end of the function (assuming it's up to me to bring all the results I require back to cpu before deleting the gpu copy)

Solution

You are complaining that it is "hard to release memory" when running under CuPy.

https://docs.cupy.dev/en/stable/user_guide/memory.html

Attention:

you may notice memory not being freed even after the array instance goes out of scope. This is an expected behavior, as the default memory pool “caches” the allocated memory blocks.

Here is one approach:

run a coordinating parent process
fork a child that pre-processes on GPU, serializes result to FS, exits
fork a child (or execute in the parent) so you de-serialize and continue

The idea is that cache will not survive death of the worker process.

You didn't mention whether your code uses FFTs.

Setting the PlanCache size to 0 may help, as it disables caching.

Suppose that an allocation of 1 GiB was acceptable in your use case. The documentation suggests

$ export CUPY_GPU_MEMORY_LIMIT="1073741824"

to accomplish that.

The low-level docs describe additional knobs to adjust.

For example we see this advice:

If you want to disable memory pool, please use the following code. cupy.cuda.set_allocator(None)

The set_pinned_memory_allocator function works similarly.

Other management functions are available: