Search code examples
cudanvidianvccnsightcublas

Upper Limit on Matrix Size for Multiplication using cublas gemm function (cublasSgemm)


This is the first time ever that I have not been able to get help from answers of previously posted questions.

I have been using cublasSgemm quite successfully for multiplying square matrices. But, recently I observed that if the number of rows or columns increases beyond 269 (i.e. 270 x 270 matrices and above), I begin to get "Memory Access Violations", when I debug by enabling Nsight Cuda Memory Checker. If I do not enable memory checker then there are no exceptions and the results are also correct.

Following is the exact error message

Memory Checker detected 64 access violations

access violations on store (global memory)

Is it a limitation of my gpu or the cublasSgemm function? What can I do to resolve this issue?

I am using Cuda 6.5 with MS Visual Studio 2012 on Quadro FX 1800M (sm_12). OS is MS Windows 7 64-bit.

I am including a stripped down version of the code below

#include <stdio.h>
#include <cuda.h>
#include <cublas_v2.h>

int main(int argc, char **argv)
{
const int m = 269; // for 1 - 269 there are no access violations
// but as soon as m >= 270 Memory Checker throws memory access violations
// Note: the results are correct even with these violations
float *X = new float[m*m];
float *Y = new float[m*m];
float *Z = new float[m*m];
float *devX, *devY, *devZ;
cublasHandle_t handle;
cudaError_t err;
cublasStatus_t err1;

//simple initialization
for(unsigned long i = 0; i < m*m; i++)
{
    X[i] = 1;
    Y[i] = 2;
}

err1 = cublasCreate(&handle);
if(err1 != CUBLAS_STATUS_SUCCESS)
  return 1;

err = cudaMalloc((void **)&devX, m*m*sizeof(*devX));
if(err != CUBLAS_STATUS_SUCCESS)
  return 1;

err = cudaMalloc((void **)&devY, m*m*sizeof(*devY));
if(err != CUBLAS_STATUS_SUCCESS)
  return 1;

err = cudaMalloc((void **)&devZ, m*m*sizeof(*devZ));
if(err != CUBLAS_STATUS_SUCCESS)
  return 1;


err1 = cublasSetMatrix(m, m, sizeof(*X), X, m, devX, m);
if(err1 != CUBLAS_STATUS_SUCCESS)
  return 1;

err1 = cublasSetMatrix(m, m, sizeof(*Y), Y, m, devY, m);
if(err1 != CUBLAS_STATUS_SUCCESS)
  return 1;

////////////////////////////////////////////////////////////
printf("Reached sgemm without error\n");
const float alpha = 1.0f, beta = 0.0f;
// cuda memory checker detects access violations when m > 269
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, m, m, &alpha, devX, m, devY, m, &beta, devZ, m);
cudaDeviceSynchronize();
printf("reached after sgemm without error\n");
////////////////////////////////////////////////////////////

err1 = cublasGetMatrix(m, m, sizeof(*devZ), devZ, m, Z, m);
if(err != CUBLAS_STATUS_SUCCESS)
  return 1;

// just printing a single element for brevity
printf("....%f....", Z[0]); 

cudaFree(devX);
cudaFree(devY);
cudaFree(devZ);
cublasDestroy_v2(handle);
getchar();
return 0;
}

EDITED

Update: Same result even after disabling TDR, as shown in this image

EDITED AGAIN

Compiled and ran the cublas Sample downloaded from:

https://people.maths.ox.ac.uk/gilesm/cuda/prac5/simpleCUBLAS.cpp

and again for N > 500 get the same error as before.

If Cuda Memory Checker is not enabled then as before this program runs to completion successfully and displays the "test passed" message.

Actually the access violations begin from N = 350 but at that point they are unpredictable i.e. they occur sometimes and some other times they don't occur. But for N > 500 they always occur

Used cudaDeviceGetLimit(&heap_size, cudaLimitMallocHeapSize); to get a heap_size of 3435973836 bytes. So, presumably this isn't the issue either!

EDITED I have now run the sample project code at 'C:\ProgramData\NVIDIA Corporation\CUDA Samples\v6.5\7_CUDALibraries\simpleCUBLAS'. No luck!!

EDITED Could using a single GPU be the reason?


Solution

  • Even though following is not a 'complete' answer but even then i have decided to share my observation

    I have finally decided to switch back to using cuda on linux. Using cuda-gdb with memcheck on. Although running linux in level 1 is not fun but all the uncertainity associated with using windows is removed The above code now runs for even N = 15000.

    In short, cublas gemm functions are only limited by hardware capability