OpenCL nested loop produces unexpected results

I'm a bit new to OpenCL/C99, but can't understand why the the two kernels below give different results. X has been initialized to zeros, but requires "re-zeroing" at each outer loop step else incorrect results are obtained (see plot). Notice that I have not invoked any parallelism here; to the best of my knowledge, this is entirely serial:

content of kernels.cl:

#pragma OPENCL EXTENSION cl_khr_fp64: enable
#define __CL_ENABLE_EXCEPTIONS

__kernel void nested_sum_succeeds(
    int L,
    __global __read_only double* X,
    __global double* Y
)
{
    for (int k=0; k<=L; k++) {
        Y[k] = 0;
        for (int j=0; j<=k; j++) {
            Y[k] += X[j];
        }
    }
}

__kernel void nested_sum_fails(
    int L,
    __global __read_only double* X,
    __global double* Y
)
{
    for (int k=0; k<=L; k++) {
        // Y[k] = 0;
        for (int j=0; j<=k; j++) {
            Y[k] += X[j];
        }
    }
}

content of script.py:

import numpy as np
import pyopencl as cl
import pyopencl.array as cl_array
import matplotlib.pyplot as plt

with open("./kernels.cl") as fp:
    prog_str = fp.read()
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
prog = cl.Program(ctx, prog_str).build()

L = 1000
X = np.linspace(0, 10, L)
X_dev = cl_array.to_device(queue, X)
Y_succeeds_dev = cl_array.to_device(queue, np.zeros(shape=X.shape, dtype=np.float64))
Y_fails_dev = cl_array.to_device(queue, np.zeros(shape=X.shape, dtype=np.float64))


nested_sum_succeeds = prog.nested_sum_succeeds
nested_sum_succeeds.set_scalar_arg_dtypes([
    np.int64,
    None,
    None,
])

nested_sum_succeeds(
    queue,
    (len(X),),
    None,
    L,
    X_dev.data,
    Y_succeeds_dev.data,
)


nested_sum_fails = prog.nested_sum_fails
nested_sum_fails.set_scalar_arg_dtypes([
    np.int64,
    None,
    None,
])

nested_sum_fails(
    queue,
    (len(X),),
    None,
    L,
    X_dev.data,
    Y_fails_dev.data,
)

np.allclose(Y_succeeds_dev.get(), Y_fails_dev.get()) #False

plt.ion()
plt.plot(Y_succeeds_dev.get())
plt.plot(Y_fails_dev.get())

Results:

Solution

Notice that I have not invoked any parallelism here;

Yes, the kernel does run in parallel - it's scheduled to run len(x) work items in the first dimension. Once you change it to process using 1 work item everything is alright:

nested_sum_succeeds(
    queue,
    (1,),
    None,
    L,
    X_dev.data,
    Y_succeeds_dev.data,
)

nested_sum_fails(
    queue,
    (1,),
    None,
    L,
    X_dev.data,
    Y_fails_dev.data,
)

Then np.allclose(Y_succeeds_dev.get(), Y_fails_dev.get()) returns True. You can also remove that zeroing Y[k] = 0; from the nested_sum_succeeds kernel as it's not needed.

Also if you want to run this kernels in other devices then you need some minor fixes because not all compilers are going to accept that the type of the first kernel argument is in the kernel int and in python is specified as np.int64, that has to match what is in the kernel, so:

nested_sum_succeeds.set_scalar_arg_dtypes([
    np.int32,
    None,
    None,
])

nested_sum_fails.set_scalar_arg_dtypes([
    np.int32,
    None,
    None,
])

And one more thing that applies to using on other devices, I would remove __read_only access qualifier which won't also compile on the all devices.