Search code examples
pythondaskcupy

DASK GPU running out of memory with unified allocator


I asked a somewhat similar question the other day. I have tried pinging the DASK slack forum, and their discourse forum but to no avail.

I am currently trying to create a large memory on the CPU, move chunks of data to the GPU to perform multiplication, and then move it back to the CPU. I keep getting a memory error, even for matrices of size (512, 512, 1000).

I have searched the web and some pointed out that a problem could be the memory allocator, which can be set to be done automatically. However, I keep getting the memory error.

import cupy as cp
import numpy as np
import dask.array as da
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
import cudf 

if __name__ == '__main__':
    
    cluster = LocalCUDACluster('0', n_workers=1)
    client = Client(cluster)    
    client.run(cudf.set_allocator, "managed")


    shape = (512, 512, 1000)
    chunks = (100, 100, 1000)

    huge_array_gpu = da.ones_like(cp.array(()), shape=shape, chunks=chunks)
    array_sum = da.multiply(huge_array_gpu, 17).compute()
   

Am I overlooking something?


Solution

  • You are attempting to set the cuDF allocator but only using CuPy to compute. Each library has their own allocator you need to set accordingly. The proper way to achieve what you are trying to do is to do a few modifications, enabling unified memory directly for LocalCUDACluster, and then setting CuPy's allocator to use RMM (RAPIDS Memory Manager, which cuDF utilizes under-the-hood).

    To achieve the above, you will need to import RMM, change how you start the cluster to add rmm_managed_memory=True and set CuPy's allocator to use RMM. Note that cuDF's default allocator is RMM, so when you set rmm_managed_memory=True in LocalCUDACluster, cuDF will implicitly use managed memory, unlike CuPy.

    Also note that you are using compute() at the end. That call will bring the data back to the client, in other words, it will transfer a copy from the Dask GPU Cluster back to the client's GPU. A more appropriate way of executing work in Dask is to use persist(), which will compute the work but keep the results on the cluster for further consumption. If you bring the data back to your client, given you're using GPU 0 only, you will have the Dask GPU Worker and Dask Client competing for the same GPU memory, which may eventually cause out-of-memory errors. What you can try doing in that case is to also set the client to use managed memory with RMM and set CuPy's allocator accordingly.

    The complete code should look like the following:

    import cupy as cp
    import numpy as np
    import dask.array as da
    from dask_cuda import LocalCUDACluster
    from dask.distributed import Client, wait
    import rmm
    
    if __name__ == '__main__':
    
        cluster = LocalCUDACluster('0', rmm_managed_memory=True)
        client = Client(cluster)
        client.run(cp.cuda.set_allocator, rmm.rmm_cupy_allocator)
    
        # Here we set RMM/CuPy memory allocator on the "current" process,
        # i.e., the Dask client.
        rmm.reinitialize(managed_memory=True)
        cp.cuda.set_allocator(rmm.rmm_cupy_allocator)
    
        shape = (512, 512, 30000)
        chunks = (100, 100, 1000)
    
        huge_array_gpu = da.ones_like(cp.array(()), shape=shape, chunks=chunks)
        array_sum = da.multiply(huge_array_gpu, 17).persist()
        # `persist()` only does lazy evaluation, so we must `wait()` for the
        # actual compute to occur.
        wait(array_sum)
    
        # Bring data back to client if necessary
        # array_sum.compute()