python c multidimensional-array opencl pyopencl

PyOpenCL 2D array kernel get_global_id(1) error

I'm really a newbie with OpenCL. I've taken the example code from this website: http://www.drdobbs.com/open-source/easy-opencl-with-python/240162614?pgno=2 and I've customized it a little bit. I aim to send to the kernel a 4x4 matrix filled with 1 numbers and recover it back from the kernel. I know it's a trivial code, but I need to do this to understand how OpenCL works. The input matrix is this one:

 [[ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]]

However, the output I get from the kernel is this one and should be the same as the input:

[[ 1.  1.  1.  1.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]]

This is my full code:

import pyopencl as cl
from pyopencl import array
import numpy as np

## Step #1. Obtain an OpenCL platform.
platform = cl.get_platforms()[0]

## It would be necessary to add some code to check the check the support for
## the necessary platform extensions with platform.extensions

## Step #2. Obtain a device id for at least one device (accelerator).
device = platform.get_devices()[1]

## It would be necessary to add some code to check the check the support for
## the necessary device extensions with device.extensions

## Step #3. Create a context for the selected device.
context = cl.Context([device])

## Step #4. Create the accelerator program from source code.
## Step #5. Build the program.
## Step #6. Create one or more kernels from the program functions.
program = cl.Program(context, """
    __kernel void matrix_dot_vector(const unsigned int size, __global const float *matrix, __global float *result)
    {
        int x = get_global_id(0);
        int y = get_global_id(1);
        result[x + size * y] = matrix[x + size * y];
    }
    """).build()

matrix = np.ones((4,4), np.float32)

## Step #7. Create a command queue for the target device.
queue = cl.CommandQueue(context)

## Step #8. Allocate device memory and move input data from the host to the device memory.
mem_flags = cl.mem_flags
#matrix_buf = cl.Buffer(context, mem_flags.READ_ONLY | mem_flags.COPY_HOST_PTR, hostbuf=matrix)
matrix_buf = cl.Buffer(context, mem_flags.READ_ONLY | mem_flags.COPY_HOST_PTR, hostbuf=matrix)
destination_buf = cl.Buffer(context, mem_flags.WRITE_ONLY, matrix.nbytes)

## Step #9. Associate the arguments to the kernel with kernel object.
## Step #10. Deploy the kernel for device execution.
program.matrix_dot_vector(queue, matrix.shape, None, np.int32(matrix.size), matrix_buf, destination_buf)

## Step #11. Move the kernels output data to host memory.
matrix_dot_vector = np.ones((4,4), np.float32)
cl.enqueue_copy(queue, matrix_dot_vector, destination_buf)

## Step #12. Release context, program, kernels and memory.
## PyOpenCL performs this step for you, and therefore,
## you don't need to worry about cleanup code

print(matrix_dot_vector)

As far as I have seen, the value of int y = get_global_id(1); is always 0. This is what causes the error, and I don't understand why it's always 0 since I'm passing the correct shape to the kernel program.matrix_dot_vector(queue, matrix.shape, None, np.int32(matrix.size), matrix_buf, destination_buf) which is the second parameter matrix.shape and equals (4,4).

Does anyone guess what's wrong?

Thanks!

Solution

There is wrong value being passed for the first kernel parameter - the size should not be the total matrix size. Change np.int32(matrix.size) into np.int32(matrix.shape[0]).