python runtime-error opencl gpu pyopencl

PyOpenCL reduction algorithm error

Recently I've been trying to learn gpu programming with PyOpenCl, but despite my best efforts, I haven't been able to get the reduction algorithm shown in the code below to run. Instead, the code returns

RuntimeError: clEnqueueReadBuffer failed: OUT_OF_RESOURCES

My understanding of this error is that it is indicative of either insufficient memory allocation or out of bounds indexing in the Kernel. For small global sizes (that is, small (N,A,t)) the code will run successfully, so I suspect the former. I allocate np.dtype(np.float32).itemsize*t bytes to local memory, however, for a work-group size of (1,1,t), which I believe should be sufficient. Does anyone know then why I'm getting this error? I'm running the Kernel on a NVIDIA GeForce GTX 960 if that helps.

import numpy as np
import pyopencl as cl

np.random.seed(5)

N=2500*56
A=6
t=64

plat = cl.get_platforms()
devices = plat[0].get_devices()
ctx = cl.Context([devices[0]])
queue = cl.CommandQueue(ctx)

actions=np.random.randint(0,2,(N,A,t)).flatten(order='F')
tau=np.arange(1,np.add(t,1))
d=np.random.rand(N).astype(np.float32)
baseAct=np.empty((N,A)).astype(np.float32).flatten(order='F')

mf = cl.mem_flags
actions_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, 
hostbuf=actions)
tau_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=tau)
d_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=d)
loc_buf = cl.LocalMemory(np.dtype(np.float32).itemsize*t)
baseAct_buf = cl.Buffer(ctx, mf.WRITE_ONLY, baseAct.nbytes)

prg = cl.Program(ctx, """
    __kernel void calc_baseAct(__global const int *actions,
    __global const int *tau,
    __global const float *d,
    __local float *loc,
    __global float *baseAct,
    int N,
    int A,
    int t)
    {
      int xg = get_global_id(0);
      int yg = get_global_id(1);
      int zg = get_global_id(2);
      int xl = get_local_id(0);
      int yl = get_local_id(1);
      int zl = get_local_id(2);
      int xw = get_group_id(0);
      int yw = get_group_id(1);
      int zw = get_group_id(2);

      loc[xl+N*yl+N*A*zl] = actions[xg+N*yg+N*A*zg]*pow(tau[zg],-d[xg]);
      barrier(CLK_LOCAL_MEM_FENCE);


      for(uint s = t/2; s > 0; s >>= 1) {
        if(zl < s) {
          loc[xl+N*yl+N*A*zl] += loc[xl+N*yl+N*A*(zl+s)];
        }
        barrier(CLK_LOCAL_MEM_FENCE);
      }    
      if(zl == 0) baseAct[xw+N*yw+N*A*zw] = loc[xl+N*yl+N*A*zl];

    }
    """).build()

prg.calc_baseAct(queue, (N,A,t), (1,1,t), actions_buf, tau_buf, d_buf, 
loc_buf, baseAct_buf, np.int32(N), np.int32(A), np.int32(t))
cl.enqueue_copy(queue, baseAct, baseAct_buf)

baseAct=baseAct.reshape((N,A), order='F')

Solution

Clearly out of bound access for loc which is allocated to have 64 elements per workgroup and is accessed with index of xl+N*yl+N*A*zl where zl is in range [0,63] multiplied by N=2500*56 and A=6.