I am getting started with CUDA programming and I have a question about the kernel coding part. Below is the code I was trying out.
I was trying to get it to print the numbers 1-64 using 8 blocks of 8 threads each. To see that the program is using 8 blocks of 8 threads.
The problem is that my output is something impossibly large and different every time and only one value.
#include <stdio.h>
__global__
void start(int *a){
*a = blockIdx.x*threadIdx.x*blockDim.x;;
}
int main(){
int a;
int *d_a;
int size = 64*sizeof(int);
cudaMalloc((void**)&d_a,size);
cudaMemcpy(d_a,&a,size, cudaMemcpyHostToDevice);
start<<<8,8>>>(d_a);
cudaMemcpy(&a,d_a,size,cudaMemcpyDeviceToHost);
cudaFree(d_a);
printf("%d\n",a);
return 0;
}
EDIT: Alright, this is going to sound very dumb, but how do I check if the code was actually sent to the GPU card? I suspect the kernel code isn't being processed at all. Maybe because the GPU is off or something. I am using PUTTY so I don't have physical access to the actual machine.
Two problems, all in the same line of code.
*a = blockIdx.x*threadIdx.x*blockDim.x;;
1. All your threads are writing to the same location. Assuming you want an array containing 1-64 this is not what you want to do. You want something like this:
a[id] = id;
Your arithmetic is wrong. If you want your blocks and threads to map into 1-64 you can use this instead
blockIdx.x*blockDim.x+threadIdx.x;
Putting everything together you can do this:
int id= blockIdx.x*blockDim.x+threadIdx.x;
a[id] = id;