I'm trying to use pycuda to accelerate my neural net (I know tensorflow is easier to use for GPU acceleration, I just wanted to do it manually first as I am relatively new to neural networks), but whenever I pass an array to the GPU and have each thread print out the value of the array at the threadIdx, it prints zeros even though I set the array values.
I have tried using an extremely simple kernel for testing that just prints the values of a one dimensional array, and I have tried changing the data type to float32.
The basic kernel that I'm using for testing of this issue:
test_mod = SourceModule("""
__global__ void test(float *a)
{
printf("%d: %d\\n", threadIdx.x, a[threadIdx.x]);
}
""")
The python code I'm using to create the array and initialize the kernel:
a = np.asarray([4,2,1])
a = a.astype(np.float32)
test_module = test_mod.get_function("test")
test_module(cuda.In(a), block=(3, 1, 1))
I expect it to print some order of 4, 2, and 1, but each thread prints a 0.
The problem lies in the print statement within the kernel. The %d
format specifier is intended for integers. It will not correctly format a floating point value. To fix it, modify the kernel like this:
test_mod = SourceModule("""
__global__ void test(float *a)
{
printf("%d: %f\\n", threadIdx.x, a[threadIdx.x]);
}
""")
[Answer assembled from comments and added as a community wiki entry to try and get the question off the unaswered queue for the CUDA tag].