Search code examples
pythonparallel-processingpyopencl

PyOpencl - Wrong arguments passed to kernell


I'm having a hard time doing a basic scalar operation with PyOpenCl Basically, what i'me trying to do is, given a float type array, multiply each array element by a scalar float and put the result on a new buffer. This should be easy but for some reason it's not wroking as it should.

This is the code i'm using: (Variables with _h are HOST variables; Variables with _g are device variables)

import numpy as np
import pyopencl as cl
# Device Init
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
MF = cl.mem_flags

# Host Initial Variables
a_h = np.array([1.0, 2.0, 3.0, 4.0])
b_h = np.float32(2.0)

# DEVICE Variable Allocation
a_g = cl.Buffer(ctx, MF.READ_ONLY | MF.COPY_HOST_PTR, hostbuf=a_h)
c_g = cl.Buffer(ctx, MF.WRITE_ONLY, a_h.nbytes)

# DEVICE's Kernel - Multiply each element of the array a_g by the scalar b_g and put the result on the array c_g
source = """
__kernel void mult(float b_g, __global float *a_g, __global float *c_g){
    const int gid = get_global_id(0);
    c_g[gid] = b_g * a_g[gid];
}
"""

prg = cl.Program(ctx, source).build()
prg.mult(queue, a_h.shape, None, b_h, a_g, c_g)

# Export The Result On The DEVICE Back To The HOST
c_h = np.empty_like(a_h)
cl.enqueue_copy(queue, c_h, c_g)

# Output
print c_h

The expected ouput was

[2.0 4.0 6.0 8.0]

This was the output:

[  2.56000000e+002   5.12000000e+002  -1.73777009e+308  -1.73777009e+308]

I don't understand why. I've tried reading the PyOpenCL project page but I didn't really understand much of it to be honest. I guess I'm not doing the kernel part correctly or the kernel call part.

I've tried using the kernel as this:

__kernel void mult(__global float *b_g, __global float *a_g, __global float *c_g)

But as expected it didn't work because i didn't create a pointer for b_g nor i know how to create one. The error was:

:2:39: error: parameter may not be qualified with an address space
__kernel void mult(__global float b_g, __global float *a_g, __global float *c_g){
                                  ^

My main idea behind this is simple: Since i'm going to use this value b_g as common thing to all the workers, I want to put them on the global memory once so that every worker can acess to it instead of repeating the value for every worker.

I believe this should be really simple but I'm new to parallel computing and have no idea how to fix this.

Thank you.


Solution

  • In numpy all objects contained in the same array have a common data type. This data type is specified in the attribute dtype of the array. (numpy.array doc, numpy datatypes)

    Data in your array a_h is of data type float64 (64bit float). As your kernel is expecting 32bit floats (float32) it is interpreting the data of a_h like this. So it would see 8 32bit floats instead of 4 64bit floats.

    To make it work store the objects in a_h as float32. Therefore define it like this:

    a_h = np.array([[1.0, 2.0, 3.0, 4.0], dtype=np.float32)