memset in CUBLAS gemm is always launched in default stream

I noticed that when calling cublasSgemm function for each call of gemm from a host, there are 3 kernel invocations: memset, scal_kernel and gemm kernel itself (e.g. sgemm_large). This happens even if I use constants alpha/beta allocated in device memory. While the overhead of memset and scal_kernel is relatively small, the problem is memset is always launched in default stream which causes unnecessary synchronization.

The code:

__constant__ __device__ float alpha = 1;
__constant__ __device__ float beta = 1;

int main()
{
    // ... memory allocation skipped ...
    float* px = thrust::raw_pointer_cast(x.data());
    float* py = thrust::raw_pointer_cast(y.data());
    float* pmat = thrust::raw_pointer_cast(mat.data());
    for (int iter = 0; iter < 3; ++iter)
    {
        cbstatus = cublasSgemm(cbh, CUBLAS_OP_N, CUBLAS_OP_N, crow, ccol, cshared, &alpha, px, crow, py, cshared, &beta, pmat, crow);
        assert(0 == cbstatus);
    }
}

This is what I see in profiler:

memset in profiler

The question: is there a way to avoid memset or make it run in the stream assigned to CUBLAS handle? One idea is to use DP and run device version of the gemm function, but this will work only on CC 3.0 and higher.

Solution

There was a bug in CUBLAS5.5 where a cudaMemset was used instead of cudaMemsetAsync in the specialized path where k >> m,n.

It is fixed in CUBLAS6.0 RC. And you can have access to it if you are a registered developer.

Btw, I wonder why you use __constant__ __device__ for alpha,beta. Are you using pointerMode = DEVICE?

If not, you could simply use alpha,beta on the host.