Data race in parallelized nested loop

I have a triple nested loop that I would like to parallelize, however, I am getting a data race issue. I am pretty sure that I need to use a reduction somehow, but I don't quite know how.

This is the loop in question:

#pragma omp parallel for simd collapse(3)
    for (uint64 u = 0; u < nu; ++u) {
        for (uint64 e = 0; e < ne; ++e) {
            for (uint64 v = 0; v < nv; ++v) {
                uAT[u][e] += _uT[u][e][v] * wA[e][v];
            }
        }
    }

Could someone explain to me, why this causes a data race? I would really like to understand this, so that I don't run into these issues in the future. Also, can this loop be parallelized at all? If so, how?

EDIT: How do I know that there is a data race?

What this loop should accomplish (and it does in serial) is to compute the element average of a function in a Discontinuous Galerkin frame work. When I run the code a bunch of times, sometimes I get different results, eventhough it should always produce the same results. The resulting wrong values are always smaller than what the should be, which is why I assume that some values are not being added. Maybe this picture explains it better: The average in the third cell is obviously wrong (too small).

Solution

Original answer concerning multi-threading, not SIMD

By using the collapse(3) clause, the whole iteration space of nu * ne * nv iterations is distributed among threads. This means that for any combination of u and e, the v loop could be distributed among multiple threads as well. These can then access the same element uAT[u][e] in parallel, which is the data race.

As long as nu * ne is much bigger than the number of CPU cores that you work on, the easiest solution is to instead use collapse(2). As there can be inefficient implementations of collapse (See here), you might even want to leave it away completely depending on nu being big enough.

If you really need the parallelism from all three loops to efficiently use your hardware, you can use either reduction(+: uAT[0:nu][0:ne]) (add it into your existing pragma) or put a #pragma omp atomic update in front of uAT[u][e] += ...;. Which of these options is faster should be benchmarked. The reduction clause will use a lot more memory due to every thread getting its own private copy of the whole memory addressed through uAT. On the other hand the atomic update could in the worst case sequentialize part of your parallel work and give worse performance than using collapse(2).

Edit 1: SIMD

I just saw that you are also concerned with SIMD instead of just multi-threading. My original answer is about the latter. For SIMD I would first take a look at your compilers output without any OpenMP directives (e.g. on Compiler Explorer). If your compiler (with optimization and information on the target processor architecture) does already use SIMD instructions, you might not need to use OpenMP in the first place.

If you still want to use OpenMP for vectorizing your loops, leaving away the collapse clause and putting the pragma in front of the second loop would be my first ansatz. Leaving it in front of the first loop with collapse(2) might also work. Even a reduction should work but seems unnecessary complex in this context, as I would expect there to be enough parallelism in nu or nu * ne to fill your SIMD lanes. I have never used array reduction like described below in the SIMD context, so I'm not quite sure what it would do (i.e. allocating an array for each SIMD lane doesn't seem realistic), or if it even is part of the OpenMP standard (depends on the version of the standard, see here).

The way your code is written right now, the description of the data race for multi-threading technically still applies, I think. I'm not sure though if your code causes (efficient) vectorization at all, so the compiled binary might not have the data race (it might not be vectorized at all).

Edit 2: Threading + SIMD

I benchmarked a few versions of this loop nest for nu = 4, nv = 128 and ne between 1024 and 524288 (2^19). My benchmarking was done using google-benchmark (i.e. C++ instead of C, shouldn't matter here, I would think) with gcc 11.3 (always using -march=native -mtune=native, i.e. for portable performance this might not be helpful). I made sure to initialize all data in parallel (first touch policy) to avoid bad NUMA effects. I slightly modified the problem/code to use contiguous memory and do the multi-dimensional indexing manually. As OP's code didn't show the data type, I used float.

The three best versions I will share here are all three performing relatively similar. So depending on the hardware architecture there might be differences in which one is the best. For me the version inspired by Peter Cordes comments below this answer performed best:

#pragma omp parallel for
for (uint64_t e = 0UL; e < ne; ++e) {
  // In the future we will get `#pragma omp unroll partial(4)`.
  // The unrolling might actually not be necessary (or even a pessimization).
  // So maybe leave it to the compiler.
  #pragma GCC unroll 4
  for (uint64_t u = 0UL; u < nu; ++u) {
    float temp = 0.0f;
    #pragma omp simd reduction(+ : temp)
    for (uint64_t v = 0UL; v < nv; ++v) {
      temp += uT[(u * ne + e) * nv + v] * wA[e * nv + v];
    }
    uAT[u * ne + e] += temp;
  }
}

One could also have a float temp[nu]; and put the unrolled u loop inside the v loop to get even nearer to Peter Cordes' description, but then on would have to use an array reduction as described above. These array reductions consistently caused stack overflows for me, so I settled on this version which depends on nv being small enough that wA can still be cached between u iterations.

This second version just differs in the u loop staying on the outside:

#pragma omp parallel
for (uint64_t u = 0UL; u < nu; ++u) {
  #pragma omp for
  for (uint64_t e = 0UL; e < ne; ++e) {
    float temp = 0.0f;
    #pragma omp simd reduction(+ : temp)
    for (uint64_t v = 0UL; v < nv; ++v) {
      temp += uT[(u * ne + e) * nv + v] * wA[e * nv + v];
    }
    uAT[u * ne + e] += temp;
  }
}

The third version uses collapse(2):

#pragma omp parallel for collapse(2)
for (uint64_t u = 0UL; u < nu; ++u) {
  for (uint64_t e = 0UL; e < ne; ++e) {
    float temp = 0.0f;
    #pragma omp simd reduction(+ : temp)
    for (uint64_t v = 0UL; v < nv; ++v) {
      temp += uT[(u * ne + e) * nv + v] * wA[e * nv + v];
    }
    uAT[u * ne + e] += temp;
  }
}

TL;DR:

I think the most important points can be seen in all three versions:

#pragma omp parallel for simd is nice for big simple loops. If you have a loop nest, you probably should split the pragma up.
Use simd on a loop which accesses contiguous elements in contiguous iterations.
By respecting the first two points, you don't need the possibly expensive array reduction.
By using a temporary (i.e. register) for the reduction instead of writing back to memory, you make using OpenMP easier and probably have better performance in serial code as well. Due to floating point non-associativity, this optimization can't be done by most compilers without allowing it via e.g. -ffast-math (gcc).
Most compilers can not vectorize the reduction on their own for the same reason.
Details like Using collapse or the the order of the e and the u loop are smaller optimizations which are not as important as long as you provide enough parallelism. I.e. don't parallelize over just u if nu is small (as the question author wrote below this answer).