memory-management pytorch gpu lazy-evaluation

Does PyTorch allocate GPU memory eagerly?

Consider the following script:

import torch

def unnecessary_compute():
    x = torch.randn(1000,1000, device='cuda')
    l = []
    for i in range(5):
        print(i,torch.cuda.memory_allocated())
        l.append(x**i)
unnecessary_compute()

Running this script with PyTorch (1.11) generates the following output:

Given that PyTorch uses asynchronous computation and we never evaluated the contents of l or of a tensor that depends on l, why did PyTorch eagerly allocate GPU memory to the new tensors? Is there a way of invoking these tensors in an utterly lazy way (i.e., without triggering GPU memory allocation before it is required)?

Solution

torch.cuda.memory_allocated() returns the memory that has been allocated, not the memory that has been "used".

In a typical GPU compute pipeline, you would record operations in a queue along with whatever synchronization primitives your API offers. The GPU will then dequeue and execute those operations, respecting the enqueued synchronization primitives. However, GPU memory allocation is not usually an operation which even goes on the queue. Rather, there's usually some sort of fundamental instruction that the CPU can issue to the GPU in order to allocate memory, just as recording operations is another fundamental instruction. This means that the memory necessary for a GPU operation has to be allocated before the operation has even been enqueued; there is no "allocate memory" operation in the queue to synchronize with.

Consider Vulkan as a simple example. Rendering operations are enqueued on a graphics queue. However, memory is typically allocated via calls to vkAllocateMemory(), which does not accept any sort of queue at all; it only accepts the device handle and information about the allocation (size, memory type, etc). From my understanding, the allocation is done "immediately" / synchronously (the memory is safe to use by the time the function call returns on the CPU).

I don't know enough about GPUs to explain why this is the case, but I'm sure there's a good reason. And perhaps the limitations vary from device to device. But if I were to guess, memory allocation probably has to be a fairly centralized operation; it can't be done by just any core executing recorded operations on a queue. This would make sense, at least; the space of GPU memory is usually shared across cores.

Let's apply this knowledge to answer your question: When you call l.append(x**i), you're trying to record a compute operation. That operation will require memory to store the result, and so PyTorch is likely allocating the memory prior to enqueuing the operation. This explains the behavior you're seeing.

However, this doesn't invalidate PyTorch's claims about asynchronous compute. The memory might be allocated synchronously, but it won't be populated with the result of the operation until the operation has been dequeued and completed by the GPU, which indeed happens asynchronously.