PyOpenCL program does not return expected output

I'm just starting to learn OpenCL through PyOpenCL, and I've been following a couple tutorials. I'm working on the script here. The program executes without any errors, but the summation of the arrays is not correct. Here is the exact code:

# Use OpenCL To Add Two Random Arrays (This Way Shows Details)

import pyopencl as cl  # Import the OpenCL GPU computing API
import numpy as np  # Import Np number tools

platform = cl.get_platforms()[0]  # Select the first platform [0]
for device in platform.get_devices():
     print device
device = platform.get_devices()[2]  # Select the first device on this platform [0]
context = cl.Context([device])  # Create a context with your device
queue = cl.CommandQueue(context)  # Create a command queue with your context

np_a = np.random.rand(5).astype(np.float32)  # Create a random np array
np_b = np.random.rand(5).astype(np.float32)  # Create a random np array
np_c = np.empty_like(np_a)  # Create an empty destination array

cl_a = cl.Buffer(context, cl.mem_flags.COPY_HOST_PTR, hostbuf=np_a)
cl_b = cl.Buffer(context, cl.mem_flags.COPY_HOST_PTR, hostbuf=np_b)
cl_c = cl.Buffer(context, cl.mem_flags.WRITE_ONLY, np_c.nbytes)
# Create three buffers (plans for areas of memory on the device)

kernel = \
"""
__kernel void sum(__global float* a, __global float* b, __global float* c)
{
    int i = get_global_id(0);
    c[i] = a[i] + b[i];
}
"""  # Create a kernel (a string containing C-like OpenCL device code)

program = cl.Program(context, kernel).build()
# Compile the kernel code into an executable OpenCL program

program.sum(queue, np_a.shape, None, cl_a, cl_b, cl_c)
# Enqueue the program for execution, causing data to be copied to the device
#  - queue: the command queue the program will be sent to
#  - np_a.shape: a tuple of the arrays' dimensions
#  - cl_a, cl_b, cl_c: the memory spaces this program deals with
queue.finish()
np_arrays = [np_a, np_b, np_c]
cl_arrays = [cl_a, cl_b, cl_c]

for x in range(3):
    cl.enqueue_copy(queue, cl_arrays[x], np_arrays[x])

# Copy the data for array c back to the host

arrd = {"a":np_a, "b":np_b, "c":np_c}

for k in arrd:
    print k + ": ", arrd[k]


# Print all three host arrays, to show sum() worked

And the output:

<pyopencl.Device 'Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz' on 'Apple' at 0xffffffff>
<pyopencl.Device 'Iris Pro' on 'Apple' at 0x1024500>
<pyopencl.Device 'AMD Radeon R9 M370X Compute Engine' on 'Apple' at 0x1021c00>
a:  [ 0.44930401  0.77514887  0.28574091  0.24021916  0.3193087 ]
c:  [ 0.0583559   0.85157514  0.80443901  0.09400933  0.87276274]
b:  [ 0.81869799  0.49566364  0.85423696  0.68896079  0.95608395]

My guess as to what is happening here is that the data is being properly copied between the host and device, but the kernel isn't being executed. As far as I understand from this and other tutorials, the code should be sufficient to execute the kernel. Is there some other call that is required to launch the kernel? I'm not sure exactly which version of PyOpenCL this example uses, but I'm running 2016.2 from conda-forge on a Mac OS X. Any help much appreciated.

Solution

You have called the enqueue_copy with wrong parameter order. You should call it so:

cl.enqueue_copy(queue, np_arrays[x], cl_arrays[x])

On the other hand you don't need to copy back the input arrays, since you have created them on the host.