Puzzling bug on very tiny OpenCL kernel when trying to read an image2d (using pyopencl)

While developing an OpenCL kernel supposed to compute some features on an image, I came across a bug I didn’t manage to solve. To figure out the problem I built a silly, tiny kernel that still returns wrong values. Here it is:

__constant sampler_t sampler =  CLK_NORMALIZED_COORDS_FALSE |
                                CLK_ADDRESS_CLAMP_TO_EDGE |
                                CLK_FILTER_NEAREST;

__kernel void readImageTest(__read_only image2d_t img, __global float *result){
    const int2 coord = (int2)(get_local_id(0), get_local_id(1));
    int2 nbOfWorkers = (int2)(get_local_size(0), get_local_size(1));
    uint4 tmp = read_imageui(img, sampler, coord);
    result[coord.x + coord.y * nbOfWorkers.x] = (float)tmp.x;
}

As you can see, this kernel is made to work with only one workgroup where each thread copies the red channel of an image into a global buffer.
I call this kernel with 1 workgroup of size (2, 2) on an image of 6 by 6 pixels. Moreover only the red channels contain value different from 0. These values go from 0 to 35 with the left upper corner pixel having the red value set to 0, its right neighbor to 1 and so on, until the right lower corner pixel with the red value to 35. Here are some fragments of the python code:

def test_read_img(self):
    arr = np.array(range(0, 36), dtype=np.uint8).reshape((6, 6))
    img = np.dstack((arr, np.zeros((arr.shape[0], arr.shape[1], 3), dtype=np.uint8)))
    result = self.detector.read_img(img, (2, 2))

detector is an instance of a class that handles the OCL calls, here is the *read_img* function:

def read_img(self, image, local_size):
        cl_image = cl.Image(self.ctx,
                            self.mf.READ_ONLY | self.mf.COPY_HOST_PTR,
                            self.cl_img_format,
                            image.shape[1::-1],
                            None,
                            image)
        out_buf = cl.Buffer(self.ctx, self.mf.WRITE_ONLY,
                            size=int(local_size[0] * local_size[1] * dtype('float32').itemsize))
        self.prog.readImageTest(self.queue, local_size, local_size, cl_image, out_buf)
        result = zeros(local_size[0] * local_size[1], float32)
        cl.enqueue_copy(self.queue, result, out_buf).wait()
        return result

And finally how the variable *cl_image_format* is instanciated:

self.cl_img_format = cl.ImageFormat(cl.channel_order.RGBA,
                                    cl.channel_type.UNSIGNED_INT8)

So if everything worked fine, the result should be [0. 1. 6. 7.] instead I get [0. 24. 4. 28.].
I tested this code on three different devices: 2 ATIs and 1 NVIDIA. All returned the same false result. I also made a small C program that does the same stuff that python and called the same kernel which that time returned me the proper result. So my mistake is in the python code but I really can’t see it though it must be just under my nose. Does anybody have an idea what could be wrong?
Thanks
P.S. I'm using Win7 x64, free EPD 7.3-2 distribution, python 2.7.3 and I used the pyopencl installer from this website.

Solution

Ok I found what was wrong....stupid me. So, if someone is as absent-minded as me, here is the solution:
As Thomas suggested I tested the “reading part” and it returned the expected result.
I also read back the image with cl.enqueue_read_image. The result was wrong but explained the values I get the first time. I had sth like:

[[[ 0  6 12 18]
  [24 30  1  7]
  [13 19 25 31]
  [ 2  8 14 20]
  [26 32  3  9]
  [15 21 27 33]]

 [[ 4 10 16 22]
  [28 34  5 11]
  [17 23 29 35]
  [ 0  0  0  0]
  [ 0  0  0  0]
  [ 0  0  0  0]]
…]]]

The rest is only 0.
The reason is that the dstack function returns me an F-ordered array. D’oh!!
I fixed the problem by copying the result of the dstack function:

img = np.dstack((arr, np.zeros((arr.shape[0], arr.shape[1], 3), dtype=np.uint8))).copy()

The default order value for the copy function is 'C'