How do I Manage a Stateful Data Structure in Local Memory Shared by All Workitems in a OpenCL/SYCL Workgroup

I'm trying to optimize my memory-bound numerical simulation kernel in OpenCL/SYCL using local memory to allow data sharing between workitems, so that I can reduce redundant global memory traffic.

When there's little to no data dependency within local memory, populating the local memory is simple - one can simply subdivide the local memory's index space and assign a portion to each workitem by using a suitable index calculation equation, so all workitems collectively load the data from global memory to local memory to the appropriate locations. Then one uses a barrier to separate the loading phase and compute phase of each workitem. Communication only occurs at the compute phase and only implicitly because they share memory - during the loading phase, no synchronization, communication, or even logic is needed.

However, my kernel has a non-trivial data dependency chain (figuring the data dependency out itself is an exercise of combinatorics), because this is the only way to allow a high degree of data reuse within the space constraint of local memory. The data to be loaded into local memory is subdivided into 2x2x2 blocks, and irregular data dependencies exist between these blocks. For example, to compute the next 3 blocks, you need the previous 1 block, to compute the next 6 blocks, you need the previous 3 blocks, to compute the next 10 blocks, you need to the previous 6 blocks, compute the next 15 blocks, you need the previous 10 blocks, etc. These blocks are needed to be "mapped" to workitems on the fly.

For considerations that are beyond the scope of this question, my plan right now is using a 1024-workitem group size to reach Occupancy 4 on AMD GCN while still being able to use 64 KiB of local memory. If each workitem is responsible for calculating a single point of the 2x2x2 blocks, it means to fully saturate the GPU, I must always load 128 blocks to local memory at a time, so the number of points being worked on is 2x2x2x128 = 1024.

This requirement greatly complicates the logic of local memory management:

The local memory must be used as a ring buffer, I need to keep loading new blocks to local memory as soon as I retire old blocks. I need to manipulate several pointers and counters.
Ideally, 128 blocks must be loaded into local memory at a time, but this is further restricted by data dependency.
When loading and retiring blocks, the data dependency between adjacent iteration must not be broken.

In this case, what is the best way to populate and manage local memory?

It's impossible to divide the blocks in an regular and communication-free manner to each workitem. I've seen that a common solution in tutorials is using if/else so that only a single workitem is responsible for loading local memory or managing the state of the kernel. However, I feel that this greatly reduces the available global memory bandwidth, since only a single workitem (e.g. if (id == 0)) is making memory requests and this is insufficient to saturate the memory controller. It also introduces workitem divergence, which can be harmful to performance. A related solution is dividing the work at wavefront boundary (e.g. if (id < 64)) and should perform better.

In summary, my question is: what are the general strategies of managing a stateful global state or global data structure shared by all workitems in a workgroup?

Solution

In summary, my question is: what are the general strategies of managing a stateful global state or global data structure shared by all workitems in a workgroup?

I've solved the problem by pre-determing the number of blocks to be read.

First, a script is used to pre-calculate the numbers of blocks to be read at each iteration, this is later added into the source code as a lookup table. Next, an unique index global_block_id is assigned to each block, from 0 to 4095. Finally, an 384-element array is created, and it's accessed via:

lds[global_block_id % 384];

The trick here is that, as long as:

Local memory is always accessed via global_block_id.
The blocks are loaded linearly into local memory and are accessed in small segments that are not larger than the array size.

A simple modulus is enough for addressing all elements in local memory, there is no need to explicitly maintain a begin and end pointer at runtime. By making the program stateless, the entire headache of runtime ring buffer management is avoided.