cuda and cublas segmentation fault

In my main.cpp I am creating some vectors on host, then copying them on the device. I also create a cublas handle because I want to use cublas :

#define N 3
int main() {
  float a[N], b[N], c[N];
  float *dev_a, *dev_b, *dev_c;
  // allocate the memory on the GPU
  cudaMalloc( &dev_a, N * sizeof(float) ) ;
  cudaMalloc( &dev_b, N * sizeof(float) );
  cudaMalloc( &dev_c, N * sizeof(float) );
  // fill the arrays 'a' and 'b' on the CPU
  for (int i=0; i<N; i++) {
    a[i] = i+0.1;
    b[i] = i*i+0.5;
    printf( "%f + %f \n", a[i], b[i]);
  }
  cudaMemcpy( dev_a, a, N * sizeof(float), cudaMemcpyHostToDevice );
  cudaMemcpy( dev_b, b, N * sizeof(float), cudaMemcpyHostToDevice );
  cublasHandle_t handle;
  cublasCreate(&handle);

 gpu_blas_sum(handle, dev_a, dev_b, dev_c, N) ;
 // copy the array 'c' back from the GPU to the CPU
 cudaMemcpy( c, dev_c, N * sizeof(float),cudaMemcpyDeviceToHost );

 // ... Free cublas memory 
}

Then I have a cuda.cu and cuda.h files in order to call gpu_blas_sum in the code above on the device

cuda.h

void gpu_blas_sum(cublasHandle_t &handle,  float *A,  float *B, float *C,  int n) ;

cuda.cu

void gpu_blas_sum(cublasHandle_t &handle, float *A, float *B, float *C, int n) {
  const float alf = 1;
  A[0] = 3;
  cublasScopy(handle,n,A,1,C,1);//C = A
  cublasSaxpy(handle,n,&alf,B,1,C,1);
}

The line A[0] = 3 in cublas.cu results in a segmentation fault. I guess then that my function gpu_blas_sum is considered as a host function.

How can I make it execute on device so that I can dereference device pointers, and take advantage of GPU speed when I use cublas functions?

Thanks for help

Solution

This is illegal:

A[0] = 3;

This is host code, but A is a device pointer. Basic cuda rules are that host code is not allowed to dereference a device pointer, and device code is not allowed to dereference a host pointer. If you dereference a device pointer in host code, a seg fault is the likely outcome (just as if you dereferenced any other pointer that was meaningless in host code, such as a NULL pointer).

If you really want to do this specific operation, just as you have written, then a tedious but workable solution would be:

float my_val = 3;
cudaMemcpy(A, &my_val, sizeof(float), cudaMemcpyHostToDevice);

If you want to move everything to the device, I suggest you study a cuda sample code that calls cublas functions from the device, such as simpleDevLibCUBLAS