Search code examples
c++cudagpucusolver

Status: execution failed, when invoking cusolverDnDgeqrf from CUDA library


I try to perform a QR factorization on GPU using the cusolver library from CUDA.

I reduced my problem to the example below.

Basically, the few steps are :

  1. I allocate memory and initialized a [5x3] matrix with 1s on the host,
  2. I allocate memory and copy the matrix on the device
  3. I initialize the solver handler with cusolverDnCreate
  4. I determine the size of the needed work space with cusolverDnDgeqrf_bufferSize
  5. And, finally, try to do the QR factorization with cusolverDnDgeqrf

Unfortunately, the last command systematically fails by returning a CUSOLVER_STATUS_EXECUTION_FAILED (int value = 6) and I can't figure out what went wrong!

Here is the faulty code:

#include <cusolverDn.h>
#include <cuda_runtime_api.h>
int main(void)
{

int N = 5, P = 3;

double *hostData;
cudaMallocHost((void **) &hostData, N * sizeof(double));
for (int i = 0; i < N * P; ++i)
    hostData[i] = 1.;

double *devData;
cudaMalloc((void**)&devData, N * sizeof(double));

cudaMemcpy((void*)devData, (void*)hostData, N * sizeof(double), cudaMemcpyHostToDevice);

cusolverStatus_t retVal;
cusolverDnHandle_t solverHandle;

retVal = cusolverDnCreate(&solverHandle);
std::cout << "Handler creation : " << retVal << std::endl;

double *devTau, *work;
int szWork;

cudaMalloc((void**)&devTau, P * sizeof(double));

retVal = cusolverDnDgeqrf_bufferSize(solverHandle, N, P, devData, N, &szWork); 
std::cout << "Work space sizing : " << retVal << std::endl;

cudaMalloc((void**)&work, szWork * sizeof(double));

int *devInfo;
cudaMalloc((void **)&devInfo, 1);

retVal = cusolverDnDgeqrf(solverHandle, N, P, devData, N, devTau, work, szWork, devInfo); //CUSOLVER_STATUS_EXECUTION_FAILED
std::cout << "QR factorization : " << retVal << std::endl;

int hDevInfo = 0;
cudaMemcpy((void*)devInfo, (void*)&hDevInfo, 1 * sizeof(int), cudaMemcpyDeviceToHost);
std::cout << "Info device : " << hDevInfo << std::endl;

cudaFree(devInfo);
cudaFree(work);
cudaFree(devTau);
cudaFree(devData);
cudaFreeHost(hostData);

cudaDeviceReset();

}

Would you see any obvious error in my code, please let me know! Many thanks.


Solution

  • Any time you are having trouble with a cuda code, you should always use proper cuda error checking and run your code with cuda-memcheck, before asking for help.

    You may also want to be aware of the fact that a fully worked QR factorization example is given in the relevant CUDA/cusolver sample code and there is also sample code in the documentation.

    With proper error checking, you may have discovered:

    1. this is not correct:

      cudaMalloc((void **)&devInfo, 1);
      

      the second parameter is the size in bytes, so it should be sizeof(int), not 1. This error results in an error in a cudaMemcpyAsync operation internal to the cusolverDnDgeqrf call, which would show up in cuda-memcheck output.

    2. This is not correct:

      cudaMemcpy((void*)devInfo, (void*)&hDevInfo, 1 * sizeof(int), cudaMemcpyDeviceToHost);
      

      the order of the pointer parameters is destination first, followed by source. So you have those parameters reversed, and this call would throw a runtime API error that you could observe if you were doing proper error checking (or visible in cuda-memcheck output).

    Once you fix those errors, then the qrf call will actually return a zero status (no error). But we're not quite done yet (again, proper error checking would let us know we are not quite done yet.)

    1. In addition to the above errors, you have made some additional sizing errors. Your matrix is of size N*P, so it has N*P elements, and you are initializing that many elements here:

      for (int i = 0; i < N * P; ++i)
          hostData[i] = 1.;
      

      but you are not allocating for that many elements on the host here:

      cudaMallocHost((void **) &hostData, N * sizeof(double));
      

      or on the device here:

      cudaMalloc((void**)&devData, N * sizeof(double));
      

      and you are not transferring that many elements here:

      cudaMemcpy((void*)devData, (void*)hostData, N * sizeof(double), cudaMemcpyHostToDevice);
      

      So in the 3 cases above, if you change N*sizeof(double) to N*P*sizeof(double) you will be able to fix those errors, and the code then runs with no errors reported by cuda-memcheck, and also no errors returned from any of the API calls.