Mixing Thrust and cuBLAS unexpected results in output

I love thrust library, especially how it nicely hides complexity of cudaMalloc, cudaFree etc.. .

I want to sum all the columns of a matrix. So I used cuBlas's "cublasSgemv" and multiple my matrix by a vector of ones. Here is my code:

void sEarColSum(std::vector<float>& inMatrix, int colSize)
{
    cublasHandle_t handle; // CUBLAS context
    float al = 1.0f; // al =1
    float bet = 1.0f; // bet =1
    int rowSize = inMatrix.size() / colSize;

    float *devOutputPtr = thrust::raw_pointer_cast(thrust::device_malloc<float>(colSize));

    thrust::device_vector<float> deviceT2DMatrix(inMatrix.begin(), inMatrix.end());
    float* device2DMatrixPtr = thrust::raw_pointer_cast(deviceT2DMatrix.data());

    thrust::device_vector<float> deviceVector(rowSize, 1.0f);
    float* deviceVecPtr = thrust::raw_pointer_cast(deviceVector.data());

    cublasCreate(&handle);
    cublasSgemv(handle, CUBLAS_OP_N, colSize, rowSize, &al, device2DMatrixPtr, colSize, deviceVecPtr, 1, &bet, devOutputPtr, 1);

    std::vector<float> outputVec(colSize);
    cudaMemcpy(outputVec.data(), devOutputPtr, outputVec.size() * sizeof(float), cudaMemcpyDeviceToHost);

    for (auto elem : outputVec)
        std::cout << elem << std::endl;
}



int main(void)
{
    std::vector < float > temp(100, 1); // A vector of 100 elements each 1 
    sEarColSum( temp, 10 ); // Means my vector will have 10 columns and 100/10 = 10 rows  
  //so I expect a output vector with 10 elements. Which all elements have the value of 10. 
}

Unfortunately result are just garbage. I were expecting a vector of ten elements with each value is ten. But instead what I get is :

30
30
-2.80392e+036
30
30
-4.95176e+029
30
6.64319e+016
-3.72391e+037
30

Am I missing anything, where did my code went wrong?
And secondly is there anyway to check forexample "float* device2DMatrixPtr" with debugger? Visual studio shows its adress but since it is in GPU memory it does not show the data inside of the adress.

Solution

The cublas function gemv does a matrix-vector product:

y = alpha*A*x + beta*y

The y in your above equation is represented by your devOutputPtr which you are allocating like this:

float *devOutputPtr = thrust::raw_pointer_cast(thrust::device_malloc<float>(colSize));

Ordinary thrust allocations like this:

thrust::device_vector<float> my_vec...

will allocate and initialize the storage, but thrust::device_malloc only allocates the storage, it does not initialize it.

Therefore your y "vector" initially contains garbage. If you had set your beta to zero, then it would not matter. But since your beta is set to 1, the contents of this uninitialized area are added to your result vector.

If you set

float bet = 0.0f;

I think you'll get the expected results (I do, with that change.)

Regarding this question:

And secondly is there anyway to check forexample "float* device2DMatrixPtr" with debugger?

You can just print the deviceT2DMatrix values out using e.g. printf or std::cout. Thrust will copy the values device->host "under the hood" for you, to facilitate that. If you want to access the device copies in the debugger, use the device-debug capability of nsight VSE on windows, or nsight EE or cuda-gdb on linux