How to subtract an array from each row in matrix in PyOpenCL?

I want to subtract the same array of dimensions 1xN to every row of a matrix of dimensions MxN. Namely, perform the same operation numpy does when doing a-b, being the former the matrix, and the latter the array. For example:

import pyopencl as cl
import pyopencl.array as cl_array
import numpy as np

a = np.matrix('1 2 3; 4 5 6').astype(np.float32)
b = np.array([1,2,3]).astype(np.float32)

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
a_dev = cl_array.to_device(queue, a)
b_dev = cl_array.to_device(queue, b)
dest_dev = cl_array.empty_like(a_dev)

prg = cl.Program(ctx, """
    __kernel void fun(
                    const unsigned int size,
                    __global const float *a,
                    __global const float *b,
                    __global float *c)
    {
      int i = get_global_id(0);
      int j = get_global_id(1);
      c[i + j*size] = a[i + j*size] - b[j];
    }
    """).build()

prg.fun(queue, a.shape, None, np.int32(len(a)), a_dev.data, b_dev.data, dest_dev.data)
print(dest_dev)

I thought that this kernel would return the matrix

[0 0 0 ; 3 3 3]

but in fact it returns

[0 1 1 ; 2 2 3]

Where is the mistake?

Solution

You have to swap the rows und columns of the matrixcomputation in the kernel. You start the kernel with the dimensions that

a.shape           # returns tuple(rows, columns)

returns. It returns (2, 3) instead of (3, 2). So change the line in the kernel, where you do the computation of the matrix to :

c[j + i*size] = a[j + i*size] - b[j];

In this case i are the number of rows, and j are the columns.

There is also a mistake when you give the size of the matrix to the kernel. Instead of

 np.int32(len(a))

whitch returns 2 the number of rows (y-axis). Change the line with the kernel launch to :

prg.fun(queue, a.shape, None, np.int32(a.shape[1]), a_dev.data, b_dev.data, dest_dev.data)

a.shape[1] gives you the size of the second dimension of the matrix, which are the number of columns, in this example 3.

There is also the possibility to query the number of columns in the kernel itself with a Work-Item Built-In function:

 unsigned int size = get_global_size(1);   // size of the 2. dim of the kernel

In this case you don't have to pass the number of columns as a kernel argument.