Search code examples
cudacublas

cublasSetVector() vs cudaMemcpy()


I am wondering if there is a difference between:

// cumalloc.c - Create a device on the device
HOST float * cudamath_vector(const float * h_vector, const int m)
{
  float *d_vector = NULL;
  cudaError_t cudaStatus;
  cublasStatus_t cublasStatus;

  cudaStatus = cudaMalloc(&d_vector, sizeof(float) * m );

  if(cudaStatus == cudaErrorMemoryAllocation) {
    printf("ERROR: cumalloc.cu, cudamath_vector() : cudaErrorMemoryAllocation");
    return NULL;
  }


  /*    THIS: */ cublasSetVector(m, sizeof(*d_vector), h_vector, 1, d_vector, 1);

  /* OR THAT: */ cudaMemcpy(d_vector, h_vector, sizeof(float) * m, cudaMemcpyHostToDevice);


  return d_vector;
}

cublasSetVector() has two arguments incx and incy and the documentation says:

The storage spacing between consecutive elements is given by incx for the source vector x and for the destination vector y.

In the NVIDIA forum someone said:

iona_me: "incx and incy are strides measured in floats."

So does this mean that for incx = incy = 1 all elements of a float[] will be sizeof(float)-aligned and for incx = incy = 2 there would be a sizeof(float)-padding between each element?

  • Except for those two parameters and the cublasHandle - does cublasSetVector() anything else what cudaMalloc() doesn't do?
  • Would it be save to pass a vector/matrix which was not created with their respective cublas*() function to other CUBLAS functions to manipulate them?

Solution

  • There is a comment in a thread of the NVIDIA Forum provided by Massimiliano Fatica confirming my statement in the above comment (or, saying it better, my comment originated by a recall of having read the post I linked to). In particular

    cublasSetVector, cubblasGetVector, cublasSetMatrix, cublasGetMatrix are thin wrappers around cudaMemcpy and cudaMemcpy2D. Therefore, no significant performance differences are expected between the two sets of copy functions.

    Accordingly, you can safely pass any array created by cudaMalloc as input to cublasSetVector.

    Concerning the strides, perhaps there is a misprint in the guide (as of CUDA 6.0), which says that

    The storage spacing between consecutive elements is given by incx for the source vector x and for the destination vector y.

    but perhaps should be read as

    The storage spacing between consecutive elements is given by incx for the source vector x and incy for the destination vector y.