Use of std::atomic_ref / cl::sycl::atomic_ref in SYCL for read after write and write after read dependencies

Lets say we have 2 vectors V and W of n length. I launch a kernel in SYCL that performs 3 iterations of a for loop on each entity of V. the description of for loop is as follows:

First the loop computes the value of W at an index (W[idx]) based on 4 random values of V at the current iteration. i.e., W[idx] = sum (V[a] + V[b] + V[c]+ V[d]). where a,b,c, and d are not contiguous but are defined for each idx.
Updates V[idx] based on W[idx]. However, this update of V[idx] should be done only after the value at V[idx] has been used in step 1 to compute W.

Lets say I have 3 iterations of the for loop in the kernel. If one thread is in iteration 1 and tries to use V[2] of iteration 1 to compute W[idx = 18] at iteration 1. Another thread lets say is in iteration 2 and tries to compute W[2] in iteration 2 from a,b,c,d and computes V[2] already at iteration2.

If the second thread is ahead of first thread, the second thread will update the value of V[2] at the iteration 2. In that case, when first thread wants to use the V[2] of first iteration, how do I make sure that this is Syncd. in SYCL. Will using atomic_ref help in this case, considering that the second thread is aiming to write V[2] only after it has been used by thread [1]. Also to be noted that this V[2] of first iteration is also required to compute some other W's as well in the first iteration running in some other threads as well. How do I ensure that the value of V[2] in the second iteration gets updated in the second iteration, only when V[2] of first iteration has been used in all the required instances? Here is the source code:

void jacobi_relaxation(cl::sycl::queue& q, ProblemVar& obj, int current_level) {
  for (int iterations = 1; iterations <= mu1; iterations++) {
    // TODO   =>        v(k+1) = [(1 - omega) x I + omega x D^-1 x(-L-U)] x v(k) + omega x
    // D^-1
    // x
    // f
    //
    // step 1 =>        v* = (-L-U) x v
    // step 2 =>        v* = D^-1 x (v* + f)
    // step 3 =>        v = (1-omega) x v + omega x v*

    q.submit([&](cl::sycl::handler& h) {
      // Accessor for current_level matrix CSR values
      auto row = obj.A_sp_dict[current_level].row.get_access<cl::sycl::access::mode::read>(h);
      auto col = obj.A_sp_dict[current_level].col.get_access<cl::sycl::access::mode::read>(h);
      auto val = obj.A_sp_dict[current_level].values.get_access<cl::sycl::access::mode::read>(h);
      auto diag_indices
          = obj.A_sp_dict[current_level].diag_index.get_access<cl::sycl::access::mode::read>(h);

      auto vec = obj.vecs_dict[current_level].get_access<cl::sycl::access::mode::read>(h);
      auto f = obj.b_dict[current_level].get_access<cl::sycl::access::mode::read>(h);
      cl::sycl::accessor<double, 1, cl::sycl::access::mode::write> vec_star{
          obj.temp_dict[current_level], h, cl::sycl::noinit};

      // Require 2 kernels as we perform Jacobi Relaxations
      h.parallel_for(
          cl::sycl::range<1>{obj.num_dofs_per_level[current_level]}, [=](cl::sycl::id<1> idx) {
            // double diag_multiplier = 0.0;
            vec_star[idx[0]] = 0.0;
            for (std::int32_t i = row[idx[0]]; i < row[idx[0] + 1]; i++) {

              vec_star[idx[0]] += -1.0 * val[i] * vec[col[i]];

            }
            
            vec_star[idx[0]] = (1.0 / val[diag_indices[idx[0]]]) * (vec_star[idx[0]] + f[idx[0]])
                               + vec[idx[0]]; // step 2
          });
    });
    q.wait();
    q.submit([&](cl::sycl::handler& h) {
      // Accessor for current_level vector
      auto vec = obj.vecs_dict[current_level].get_access<cl::sycl::access::mode::read_write>(h);
      auto vec_star
          = obj.temp_dict[current_level].get_access<cl::sycl::access::mode::read_write>(h);

      h.parallel_for(cl::sycl::range<1>{obj.num_dofs_per_level[current_level]},
                     [=](cl::sycl::id<1> idx) {
                       vec[idx[0]] = (1.0 - omega) * vec[idx[0]] + omega * vec_star[idx[0]]; // step
                                                                                             // 3
                       vec_star[idx[0]] = 0.0;
                     });
    });
    q.wait();
  }
}

If you see, for each iteration, I am forced to launch 2 kernels so that I can create a synchronization points between the 2 calculations. and also at the end of 2nd calculation. I want to find a way in which I create only one kernel, and perform iterations inside that kernel with the synchronization present.

Solution

First of all, it is important to understand the synchronization guarantees that SYCL makes. Like many other heterogeneous models (e.g. OpenCL), SYCL only allows synchronization within a work group, not with work items from other work groups. The background here is that the hardware, driver, or SYCL implementation is not required to execute the work groups in parallel such that they make independent forward progress. Instead, the stack is free to execute the work groups in any order - in an extreme case, it could execute the work groups sequentially, one by one. A simple example is if you are e.g. on a single-core CPU. In that case, the backend thread pool of the SYCL implementation is probably just of size 1, and so the SYCL implementation might just iterate across all your work groups sequentially.

This means that it is very difficult to formulate producer-consumer algorithms [where one work item produces a value that another work items waits for] that span multiple work groups because it could always happen that the producer work group is scheduled to run after the consumer work group, potentially leading to a dead lock if available hardware resources prevent both from running simultaneously.

The canonical way to achieve synchronization across all work items of a kernel is therefore to split the kernel in two kernels, as you have done.

I'm not sure if you did this just for the code example or if it's also in your production code, but I'd like to point out that the q.wait() calls between and after the kernels seem unnecessary. queue::wait() causes the host thread to wait for completion of submitted operations, but for this use case it's sufficient if you know that the kernels run in-order. The SYCL buffer-accessor model would automatically guarantee this because the SYCL implementation would detect that both kernels read-write vec_star, therefore a dependency edge is inserted in the SYCL task graph. In general, for performance you want to avoid host synchronization unless absolutely necessary, and let the device work through all enqueued work asynchronously.

Tricks you can try

In principle, in some special cases you can maybe try other approaches. However, I don't expect them to be a better choice for most use cases than just using two kernels.

group_barrier: If you somehow manage to formulate the problem such that producer-consumer dependencies don't cross boundaries between two work groups, you can use group_barrier() for synchronization
atomic_ref: If you somehow know that your SYCL implementation/driver/hardware all guarantee that your producer work groups execute before or during the consumer work groups, you could have an atomic flag in global memory to store whether the value was already updated. You can use atomic_ref store/load to implement something like a spin lock in global memory.
Multiple buffers: It might be possible to combine the two kernels if at the end of the second kernel you store your updated vec in a temporary buffer instead of the original one. After the two kernels have completed, you flip the original and temporary buffers for the next iteration.