Post-Pascal UM can allocate more memory than the GPU memory, which swap-in swap-out between GPU memory and host memory automatically.
So what if run out of GPU memory and host memory? How can I use the swap space on disk? Virtual memory swap space seems not work in cudaMallocManaged
case. Here is how I did the experiment:
dd if=/dev/zero of=./swapfile bs=1G count=16
, mkswap
and swapon
for (i = 0; i < 8000; i++)
malloc(1<<20);
cudaMalloc
, leaving 1G of GPU memory#include <iostream>
#include <math.h>
// CUDA kernel to add elements of two arrays
__global__ void add(int n, float *x, float *y) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}
int main(void) {
long long N = 6LL * (1 << 30) / sizeof(float); // <<<<<<<<<<<
float *x, *y;
// Allocate Unified Memory -- accessible from CPU or GPU
cudaMallocManaged(&x, N * sizeof(float));
cudaMallocManaged(&y, N * sizeof(float));
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
// Launch kernel on 1M elements on the GPU
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add<<<numBlocks, blockSize>>>(N, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i] - 3.0f));
std::cout << "Max error: " << maxError << std::endl;
// Free memory
cudaFree(x);
cudaFree(y);
return 0;
}
It just get kill by oom-killer and leave swap space alone.
It there is a pure lib malloc
with 100GB workload, you can see that swap space usage is growing.
UM/UVA can use gpu memory + host memory + swap space, just like virtual memory say
There are 3 types of memory allocations that are accessible to GPU device code:
cudaMalloc
)cudaHostAlloc
)cudaMallocManaged
)None of these will make use of or have any bearing on traditional linux swap space (or the equivalent on windows). The first one is limited by available device memory, and the second two are limited by available host memory (or some other lower limit). All host-based allocations accessible to GPU device code must be resident in non-swappable memory using "swappable" here to refer to the ordinary host virtual memory management system that may swap pages out to disk.
The only space that benefits from this form of swapping is host pageable memory allocations, and these are not directly accessible from CUDA device code.