Search code examples
optimizationparallel-processingopenmphpcfalse-sharing

How minimize "false sharing" in this array processing with OpenMP?


In a C program a three-dimensional grid (an array) is processed with three nested for loops. The updated data at each grid cell (array element) is a function of some data on the current cell and some data on neighbor cells. I tried to use OpenMP to make the processing parallel by adding the following directive before the nested loops:

#pragma omp parallel for shared(x,y) default(none) collapse(3) num_threads(numth)

But the performance increases just slightly.

I guess one culprit may be false sharing. The collapse and schedule may not be dividing the cells among threads in the best way. Assuming that the array is two-dimensional and there are four threads, it seems to me that the following image shows the best way to parallelize the array processing (each color for a separate thread). But what is the easiest way to achieve this? For example, is there a way to change the default behavior of the collapse clause?

enter image description here

Some further details:

  1. The function containing the nested loops is a complicated one. So it may not be a good idea to copy-paste it here.
  2. The number of cells of the grid is typically not very large (< 100,000) but the processing algorithm is complex and computationally expensive.
  3. With MPI the speedup is very good. The basic algorithm used for MPI version is similar to what the image above shows. The difference, apart from inter-process communications, is that a separate process (instead of a separate thread) is responsible for each color.

Solution

  • I guess one culprit may be false sharing

    This unlikely to be the performance issue. Indeed, in practice, mainstream OpenMP implementation (e.g. GCC, Clang and certainly ICC) will parallelize the outermost loop. This means there is almost no false-sharing. In fact, stencil codes generally operate on different input/output arrays so to avoid race condition and complex tile scheduling, so false sharing can only happen on the output buffer and it should be clearly negligible for large arrays. Indeed, a cache line can be shared by two cores without a performance decrease on all mainstream processors. It is only when there are writes by (at least) two cores on the same shared cache-line that performance issue happens (false-sharing). This is due to the invalidation of the target cache-line (shared by other cores) during the write.

    The collapse may not be dividing the cells among threads in the best way.

    It may not be cache friendly (for large arrays), but it is good regarding false-sharing. The image you show certainly improves cache locality, but strongly increase the amount of invalidated cache-lines due to false-sharing. Indeed, two cores share the same line so 1 cache line invalidation can happen per line. This is far more than few invalidation for the whole array.

    Note that the false-sharing issue is not a problem anymore if each tile is stored contiguously in memory. That being said, such a memory layout makes the stencil much more complex to implement. Besides, it only worth it if the arrays are sufficiently large.

    Overall, if your stencil kernel is small, it is generally not a good idea to compute the contiguous dimension using multiple threads (except maybe if this dimension is very large -- too large so multiple lines does not fit in the L1 cache).

    The point is stencil codes tends to be memory-bound. This is partially because their implementation is often not very cache friendly (tiling is the key to improve that). But this is mainly because the arithmetic intensity of stencils is small (especially for small kernels). Thus they tends to be bound by the RAM. As a result having more cores does not improve much the execution time because the RAM bandwidth is already saturated (or close to be).

    One way to solve this problem, assuming you apply many repeated iteration of the same stencil, is to perform the operation in (contiguous) tiles and apply a group of multiple iterations of the stencil on each tile so to. The perfect tile shape might not be a square nor even a rectangle. Researchers work on this topic for decade so to improve performance of naive stencil codes. In the end, it turns out that writing an efficient stencil is actually pretty hard. There are tools (e.g. Pochoir) to generate efficient stencil code (typically based on the polyhedral model). Here is few research papers about the subject:

    But what is the easiest way to achieve this? For example, is there a way to change the default behaviour of the collapse clause?

    If you still want to do tiling the way describe in the picture (which is not efficient without changing the memory layout), then you can introduce 3 additional loops for the tiles depending of the number of thread used. The thing is, this is far from being so simple when the number of core is not a power of two (e.g. 6, 10, and 24 cores are quite common theses days, especially with big-little CPUs like recent Intel ones). Note that recent version of OpenMP introduces new tiling directives (but I am not sure they are implemented in mainstream compilers so far).


    Notes about NUMA architectures

    Server CPUs and AMD Zen ones tends to have multiple NUMA nodes. This means that if your code is not NUMA aware, it may only use a fraction of the memory bandwidth (typically the RAM which tends to be the bottleneck). For example, on Linux, the default page placement policy is the first touch policy: the first write in memory page causes the page to be mapped on the memory associated to the NUMA node which made the write access. This means that if you initialize the stencil arrays sequentially, then only one node will be used and all NUMA nodes will saturate it during the parallel stencil execution (others will be unused which is a waste of resources). The default behaviour can be overwritten using numactl. Alternatively, you can make your application NUMA-aware typically by initializing arrays in parallel using the same data distribution than for the computation.

    Besides, note that a dynamic scheduling can be significantly more expensive than a static one on such an architecture because of a possible memory throughput imbalance due to the data distribution on the different NUMA nodes (also because the overhead of the CPU interconnect which can be saturated too, or even due to the increased latency in some use-cases).


    Additional notes

    Note that low-level profiler are critical to validate guesses. The hypotheses done in this post might be wrong. The only way to be sure is to profile the memory throughput so to know if the code is memory bound, and count the number of cache misses (due to write invalidation) so to know if there is some significant false sharing overhead. You can get both thanks to hardware performance counters though it can be quite difficult to extract the right values for novice developers. High-level HPC profiling tools can help you in this case (AFAIK, some can generate a roofline plot.