Search code examples
memorycudaout-of-memoryallocationcusp-library

Memory use in cuda cusp linear solver


I am using cusp::bicgstab to solve a linear system Ax=b, in which A is a 3D Poisson on MxNxP mesh, x is unknowns, and b is the RHS. I have a K40m Tesla which has 12GB memory.

I tested with M=2000, N=2000, P=20 (80 millions unknowns), variable type is double; so the total memory used (for A, x, b, and others) is approximately 5.5GB. The code works fine.

Then I increased the value of M or N to 2500 (memory used is still far less than 12GB), the program encountered the following error:

terminate called after throwing an instance of 'thrust::system::detail::bad_alloc'

what(): std::bad_alloc: out of memory
Aborted (core dumped)

I see that the error is "out of device memory". Therefore, I am wondering about the memory management in cusp library. Does it use about the same memory space for extra variables (as used for A,x,b) during the iterations to solve the system?

Below is my code:

#include <iostream>
#include <cuda.h>
#include <cuda_runtime_api.h>

#include <cusp/monitor.h>
#include <cusp/krylov/bicgstab.h>
#include <cusp/gallery/poisson.h>
#include <cusp/print.h>

// where to perform the computation
typedef cusp::device_memory MemorySpace;

// which floating point type to use
typedef double ValueType;

int main(int argc, char **argv)
{
    size_t avail, total;                // Available and Total memory count
    int N = 2500, M = 2000, P = 20;     // Dimension
    
    // create a matrix for a 3D Poisson problem on a MxNxP grid
    cusp::dia_matrix<int, ValueType, MemorySpace> A;
    cusp::gallery::poisson7pt(A, N, M, P);

    // allocate storage for solution (x) and right hand side (b)
    cusp::array1d<ValueType, MemorySpace> x(N*M*P, 0.0);
    cusp::array1d<ValueType, MemorySpace> b(N*M*P, 1.0);
    
    // set preconditioner (identity)
    cusp::identity_operator<ValueType, MemorySpace> ID(A.num_rows, A.num_rows);
    
    // Set stopping criteria:
    // ... iteration_limit    = 100
    // ... relative_tolerance = 1e-9
    // ... absolute_tolerance = 0
    cusp::default_monitor <ValueType> monitor(b, 100, 1e-9);

    // solve the linear system A x = b
    cusp::krylov::bicgstab(A, x, b, monitor, ID);

    // Get device memory usage
    cudaMemGetInfo( &avail, &total );
    size_t used = total - avail;
    std::cout << "Device memory used: " << used/(1024.*1024.*1024.) << " Gb " << std::endl;
    
    return 0;
}

Solution

  • You can read the source for the bicgstab solver yourself, but it looks like there are eight temporary arrays, each with the same number of entries as rows in your matrix. If I have read your code correctly, that means that you would need to have at least 8 * N * M * P * sizeof(double) bytes of free GPU memory on entry to the bicgstab call for the solver to run.