cuda & cublas:call a global function after using cublas

I write a program involves some matrix-vector multiplication and least-square solving all using cublas & cula API . The program will iterate many times . in each step I must set one matrix's particular row all to zero.

I tried to copy entire matrix(50*1000 or larger) into cpu and set one row to zero then copy matrix back, but it is too time-consuming because program will iterate 10 times or more. So I decide to write a kernel function.

The global function like this:

__global__ void Setzero(float* A, int index) /* A is the matrix and in col-major , index is the row I want to set zero */
{
    int ind=blockDim.x*blockIdx.x+threadIdx.x;
    if( ((ind%N)==index ) && (ind<50000) )  //notice matrix is in col-major ,matrix size is 50000
    {   
    A[ind]=0.0;
        ind+=blockDim.x*blockIdx.x;
    }
    else    ;
        __syncthreads();   
}

The question is when I do this(use cublas before call the function ):

cudaMalloc((void**)&A_Gpu_trans,sizeof(float)*50000);
cudaMemcpy(A_Gpu_trans,A_trans,sizeof(float)*M*N,cudaMemcpyHostToDevice);
cublasSgemv_v2(handle,CUBLAS_OP_N,1000,50,&al,A_Gpu_trans,1000,err_gpu,1,&beta,product,1);
dim3 dimBlock(16,1);
dim3 dimGrid((50000-1)/16+1,1);
Setzero<<<dimGrid,dimBlock>>>(A_Gpu_trans,Index);

It return the error:

a __host__ function("Setzero") redeclared with __global__.

and an other error:

MSB3721: command“"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.5\bin\nvcc.exe" -gencode=arch=compute_10,code=\"sm_10,compute_10\" --use-local-env --cl-version 2010 -ccbin "D:\Program Files\Microsoft Visual Studio 10.0\VC\bin" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.5\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.5\include" -G --keep-dir Debug -maxrregcount=0 --machine 32 --compile -cudart static -g -DWIN32 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Zi /RTC1 /MDd " -o Debug\kernel.cu.obj "C:\Users\Administrator\documents\visual studio 2010\Projects\OOmp\OOmp\kernel.cu"”return 2。

It is strange when I only use cublas & cula API I can get the right answer.

Solution

Although you have not shown it in your question, you have clearly got another host function called Setzero somewhere in your code. The simple solution is to rename the kernel to something else.

The underlying reason why the CUDA toolchain emits the error is because the Setzero<<< >>> kernel invocation syntax in the runtime API causes the CUDA front end to create a host function of the same name as the kernel with a matching argument list and substitute the kernel launch for a call to that function. This host function contains the necessary API calls to launch the kernel. By having another host function with the same name as the kernel, you defeat this process and cause the compilation error you see.