I'm getting unpredictable errors in my program when running with a lot of threads and am trying to find where the problem sits by inserting critical sections. Now I've got to the following
void A::func()
{
#pragma omp parallel for
for (int i=0; i<100; ++i)
{
#pragma omp critical
results[i] = compute(i);
}
}
This still gives me the problem that I'm trying to find. The only way I can get it to disappear is by removing the parallelisation of the for
loop (not very desirable!).
Note that the compute
function is complicated and actually uses some caches. I wouldn't be surprised if there is some problem in that code, but as the whole thing is now in a critical section, I can't understand what's going on.
The actual for loop that I'm trying to debug is at https://github.com/UCL/STIR/blob/49c65340a2116834fea6a9a5cb17f73eb4a6f259/src/scatter_buildblock/ScatterSimulation.cxx#L179-L198. The test version that still has the problem (close to the example above) is here
Note: I had first posted this in a more complicated scenario but I've deleted that post as 1) my simple example there doesn't exhibit the problem and 2) this scenario is even more puzzling.
Some more info added. I'm trying this on Windows 10 with a Ryzen 9 5900. My test code shows occasional problems with gcc-8 and gcc-9 on WSL2/Ubuntu 20.04. (I observe no problems with VS 2019 on windows nor clang-13 on WSL2/Ubuntu 20.04, but that doesn't mean a lot I guess).
This is the code snippet in question from github.
#ifdef STIR_OPENMP
#pragma omp parallel for reduction(+:total_scatter) schedule(dynamic)
#endif
for (int i = 0; i < static_cast<int>(all_bins.size()); ++i)
{
const Bin bin = all_bins[i];
const double scatter_ratio = scatter_estimate(bin);
#if defined STIR_OPENMP
# if _OPENMP >= 201107
# pragma omp atomic write
# else
# pragma omp critical(ScatterSimulationByBin_process_data_for_view_segment_num)
# endif
#endif
viewgram[bin.axial_pos_num()][bin.tangential_pos_num()] =
static_cast<float>(scatter_ratio);
total_scatter += static_cast<double>(scatter_ratio);
} // end loop over bins
These are the possible causes of your problem:
scatter_estimate
is not threadsafe. Using #pragma omp critical
before the function would have solved it if this is the case.bin.axial_pos_num()
or bin.tangential_pos_num()
is not threadsafe. Please check these as well.i
gives the same indices (i.e. bin.axial_pos_num()
and bin.tangential_pos_num()
) then different execution order may give different result, because you overwrite the same element of viewgram
array in different order.
Note that, if this is the case it may be a flaw in your algorithm.