In my main.cpp I am creating some vectors on host, then copying them on the device. I also create a cublas handle because I want to use cublas :
#define N 3
int main() {
float a[N], b[N], c[N];
float *dev_a, *dev_b, *dev_c;
// allocate the memory on the GPU
cudaMalloc( &dev_a, N * sizeof(float) ) ;
cudaMalloc( &dev_b, N * sizeof(float) );
cudaMalloc( &dev_c, N * sizeof(float) );
// fill the arrays 'a' and 'b' on the CPU
for (int i=0; i<N; i++) {
a[i] = i+0.1;
b[i] = i*i+0.5;
printf( "%f + %f \n", a[i], b[i]);
}
cudaMemcpy( dev_a, a, N * sizeof(float), cudaMemcpyHostToDevice );
cudaMemcpy( dev_b, b, N * sizeof(float), cudaMemcpyHostToDevice );
cublasHandle_t handle;
cublasCreate(&handle);
gpu_blas_sum(handle, dev_a, dev_b, dev_c, N) ;
// copy the array 'c' back from the GPU to the CPU
cudaMemcpy( c, dev_c, N * sizeof(float),cudaMemcpyDeviceToHost );
// ... Free cublas memory
}
Then I have a cuda.cu and cuda.h files in order to call gpu_blas_sum
in the code above on the device
cuda.h
void gpu_blas_sum(cublasHandle_t &handle, float *A, float *B, float *C, int n) ;
cuda.cu
void gpu_blas_sum(cublasHandle_t &handle, float *A, float *B, float *C, int n) {
const float alf = 1;
A[0] = 3;
cublasScopy(handle,n,A,1,C,1);//C = A
cublasSaxpy(handle,n,&alf,B,1,C,1);
}
The line A[0] = 3
in cublas.cu results in a segmentation fault. I guess then that my function gpu_blas_sum is considered as a host function.
How can I make it execute on device so that I can dereference device pointers, and take advantage of GPU speed when I use cublas functions?
Thanks for help
This is illegal:
A[0] = 3;
This is host code, but A
is a device pointer. Basic cuda rules are that host code is not allowed to dereference a device pointer, and device code is not allowed to dereference a host pointer. If you dereference a device pointer in host code, a seg fault is the likely outcome (just as if you dereferenced any other pointer that was meaningless in host code, such as a NULL pointer).
If you really want to do this specific operation, just as you have written, then a tedious but workable solution would be:
float my_val = 3;
cudaMemcpy(A, &my_val, sizeof(float), cudaMemcpyHostToDevice);
If you want to move everything to the device, I suggest you study a cuda sample code that calls cublas functions from the device, such as simpleDevLibCUBLAS