I've been working on a new problem with PyopenCl in which i have to deal with complex numbers. More accurately, it would be really handy to use a 2D numpy array with complex numbers inside. Something like: np_array[np_array[C_number, C_number, ..], np_array[C_number, C_number, ..], ...]
Then for the results i would need a simple 1D numpy array of complex numbers.
I've also noticed that pyopencl sees a numpy complex number as a float2, for which i use a float16 for my data data array since I have around 8 numbers to deal with.
To work out the basic operations I've built a simple program. I've worked out building the initials arrays and sending them to the kernel but the results and different from what i expected. I guess it has something to do with the thread's ID but i can't seem to figure it out.
The code i'm using is the following.
import pyopencl as cl
import numpy as np
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
MF = cl.mem_flags
M = 3
zero = np.complex64(0.0)
X1_h = np.array([1 + 1j*2, 2 + 1j*3, 3 + 1j*4]).astype(np.complex64)
X2_h = np.array([1 + 1j*2, 2 + 1j*3, 3 + 1j*4]).astype(np.complex64)
X3_h = np.array([1 + 1j*2, 2 + 1j*3, 3 + 1j*4]).astype(np.complex64)
Y1_h = np.array([4 + 1j*5, 5 + 1j*6, 6 + 1j*7]).astype(np.complex64)
Y2_h = np.array([4 + 1j*5, 5 + 1j*6, 6 + 1j*7]).astype(np.complex64)
Y3_h = np.array([4 + 1j*5, 5 + 1j*6, 6 + 1j*7]).astype(np.complex64)
aux_h = np.complex64(1 + 1j*1)
RES_h = np.empty_like(X1_h)
dados_h = []
for i in range(3):
dados_h.append(np.array([X1_h[i], X2_h[i], X3_h[i], Y1_h[i], Y2_h[i], Y3_h[i]]).astype(np.complex64))
dados_h = np.array(dados_h).astype(np.complex64)
print dados_h
aux_d = cl.Buffer(ctx, MF.READ_WRITE | MF.COPY_HOST_PTR, hostbuf=aux_h)
dados_d = cl.Buffer(ctx, MF.READ_WRITE | MF.COPY_HOST_PTR, hostbuf=dados_h)
RES_d = cl.Buffer(ctx, MF.READ_WRITE | MF.COPY_HOST_PTR, hostbuf = RES_h)
Source = """
__kernel void soma(__global float2 *aux, __global float16 *dados, __global float2 *res){
const int gid_x = get_global_id(0);
const int gid_y = get_global_id(1);
res[gid_x].x = dados[gid_y].s0;
res[gid_x].y = dados[gid_y].s1;
}
"""
prg = cl.Program(ctx, Source).build()
completeEvent = prg.soma(queue, (M,), None, aux_d, dados_d, RES_d)
completeEvent.wait()
cl.enqueue_copy(queue, RES_h, RES_d)
print "GPU"
print RES_h
The output I'm getting is the following:
[[ 1.+2.j 1.+2.j 1.+2.j 4.+5.j 4.+5.j 4.+5.j]
[ 2.+3.j 2.+3.j 2.+3.j 5.+6.j 5.+6.j 5.+6.j]
[ 3.+4.j 3.+4.j 3.+4.j 6.+7.j 6.+7.j 6.+7.j]]
GPU
[ 1.+2.j 1.+2.j 1.+2.j]
My expected output is:
[ 1.+2.j 2.+3.j 3.+4.j]
I can't understand how I'm getting that result. As said, i believe it to be something related to the thread IDs but i can't figure it out. If i use gid_x instead of gid_y on the right part of the red[gid_x] = ... I get the following
[ 1.+2.j 2.+3.j 6.+7.j]
Can anyone give me some insight in what I'm doing wrong, please?
You're launching a 1D kernel, so get_global_id(1)
will always return 0
. This explains why your kernel simply copies the first element of the dados
array into each element of the output.
Using a float16
to represent one 'row' of your input only works if you actually have 8 complex numbers per row. In your example you only have 6, which is why you don't quite get the correct results when copying from dados[gid_x]
.
To allow your code to deal with an arbitrary row size, simply pass the width in as a parameter, and then compute the linear index manually:
__kernel void soma(__global float2 *aux,
__global float2 *dados,
__global float2 *res,
int rowWidth){
const int gid_x = get_global_id(0);
res[gid_x] = dados[gid_x*rowWidth];
}
and then pass the row-width as an extra argument when you launch the kernel:
# Pass your actual row-width instead of 6!
completeEvent = prg.soma(queue, (M,), None, aux_d, dados_d, RES_d, np.int32(6))