Search code examples
cudaoperating-systemvirtualizationswapfile

How to make use of swap space on disk when run out of modern gpu memory?


Post-Pascal UM can allocate more memory than the GPU memory, which swap-in swap-out between GPU memory and host memory automatically.

So what if run out of GPU memory and host memory? How can I use the swap space on disk? Virtual memory swap space seems not work in cudaMallocManaged case. Here is how I did the experiment:

  1. create swap space: dd if=/dev/zero of=./swapfile bs=1G count=16, mkswap and swapon
  2. create host memory occupier, burn out 99% of host memory
for (i = 0; i < 8000; i++)
   malloc(1<<20);
  1. create GPU memory occupier with cudaMalloc, leaving 1G of GPU memory
  2. offical uvm demo but with a 12GB workload, which will run out of all available GPU memory and host memory
#include <iostream>
#include <math.h>

// CUDA kernel to add elements of two arrays
__global__ void add(int n, float *x, float *y) {
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  for (int i = index; i < n; i += stride)
    y[i] = x[i] + y[i];
}

int main(void) {
  long long N = 6LL * (1 << 30) / sizeof(float);  // <<<<<<<<<<< 
  float *x, *y;

  // Allocate Unified Memory -- accessible from CPU or GPU
  cudaMallocManaged(&x, N * sizeof(float));
  cudaMallocManaged(&y, N * sizeof(float));

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Launch kernel on 1M elements on the GPU
  int blockSize = 256;
  int numBlocks = (N + blockSize - 1) / blockSize;
  add<<<numBlocks, blockSize>>>(N, x, y);

  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i] - 3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  cudaFree(x);
  cudaFree(y);

  return 0;
}

It just get kill by oom-killer and leave swap space alone.

It there is a pure lib malloc with 100GB workload, you can see that swap space usage is growing.

UM/UVA can use gpu memory + host memory + swap space, just like virtual memory say


Solution

  • There are 3 types of memory allocations that are accessible to GPU device code:

    • ordinary (e.g. cudaMalloc)
    • pinned (e.g. cudaHostAlloc)
    • managed (e.g. cudaMallocManaged)

    None of these will make use of or have any bearing on traditional linux swap space (or the equivalent on windows). The first one is limited by available device memory, and the second two are limited by available host memory (or some other lower limit). All host-based allocations accessible to GPU device code must be resident in non-swappable memory using "swappable" here to refer to the ordinary host virtual memory management system that may swap pages out to disk.

    The only space that benefits from this form of swapping is host pageable memory allocations, and these are not directly accessible from CUDA device code.