If I pass an array using pycuda to the GPU and then print it, why does it print zeros?

I'm trying to use pycuda to accelerate my neural net (I know tensorflow is easier to use for GPU acceleration, I just wanted to do it manually first as I am relatively new to neural networks), but whenever I pass an array to the GPU and have each thread print out the value of the array at the threadIdx, it prints zeros even though I set the array values.

I have tried using an extremely simple kernel for testing that just prints the values of a one dimensional array, and I have tried changing the data type to float32.

The basic kernel that I'm using for testing of this issue:

test_mod = SourceModule("""
    __global__ void test(float *a)
    {
        printf("%d: %d\\n", threadIdx.x, a[threadIdx.x]);
    }

    """)

The python code I'm using to create the array and initialize the kernel:

a = np.asarray([4,2,1])
a = a.astype(np.float32)
test_module = test_mod.get_function("test")
test_module(cuda.In(a), block=(3, 1, 1))

I expect it to print some order of 4, 2, and 1, but each thread prints a 0.

Solution

The problem lies in the print statement within the kernel. The %d format specifier is intended for integers. It will not correctly format a floating point value. To fix it, modify the kernel like this:

test_mod = SourceModule("""
    __global__ void test(float *a)
    {
        printf("%d: %f\\n", threadIdx.x, a[threadIdx.x]);
    }

    """)

[Answer assembled from comments and added as a community wiki entry to try and get the question off the unaswered queue for the CUDA tag].