multithreading performance optimization openmp

My matrix multiplication program takes quadruple time when thread count doubles

I wrote this simple program that multiplies matrices. I can specify how many OS threads are used to run it with the environment variable OMP_NUM_THREADS. It slows down a lot when the thread count gets larger than my CPU's physical threads.

Here's the program.

static double a[DIMENSION][DIMENSION], b[DIMENSION][DIMENSION],
    c[DIMENSION][DIMENSION];

#pragma omp parallel for schedule(static)
  for (unsigned i = 0; i < DIMENSION; i++)
    for (unsigned j = 0; j < DIMENSION; j++)
      for (unsigned k = 0; k < DIMENSION; k++)
        c[i][k] += a[i][j] * b[j][k];

My CPU is i7-8750H. It has 12 threads. When the matrices are large enough, the program is fastest on around 11 threads. It is 4 times as slow when the thread count reaches 17. Then run time stays about the same as I increase the thread count.

Here's the results. The top row is DIMENSION. The left column is the thread count. Times are in seconds. The column with * is when compiled with -fno-loop-unroll-and-jam.

    1024    2048    4096    4096*   8192
1   0.2473  3.39    33.80   35.94   272.39
2   0.1253  2.22    18.35   18.88   141.23
3   0.0891  1.50    12.64   13.41   100.31
4   0.0733  1.13    10.34   10.70   82.73
5   0.0641  0.95    8.20    8.90    62.57
6   0.0581  0.81    6.97    8.05    53.73
7   0.0497  0.70    6.11    7.03    95.39
8   0.0426  0.63    5.28    6.79    81.27
9   0.0390  0.56    4.67    6.10    77.27
10  0.0368  0.52    4.49    5.13    55.49
11  0.0389  0.48    4.40    4.70    60.63
12  0.0406  0.49    6.25    5.94    68.75
13  0.0504  0.63    6.81    8.06    114.53
14  0.0521  0.63    9.17    10.89   170.46
15  0.0505  0.68    11.46   14.08   230.30
16  0.0488  0.70    13.03   20.06   241.15
17  0.0469  0.75    20.67   20.97   245.84
18  0.0462  0.79    21.82   22.86   247.29
19  0.0465  0.68    24.04   22.91   249.92
20  0.0467  0.74    23.65   23.34   247.39
21  0.0458  1.01    22.93   24.93   248.62
22  0.0453  0.80    23.11   25.71   251.22
23  0.0451  1.16    20.24   25.35   255.27
24  0.0443  1.16    25.58   26.32   253.47
25  0.0463  1.05    26.04   25.74   255.05
26  0.0470  1.31    27.76   26.87   253.86
27  0.0461  1.52    28.69   26.74   256.55
28  0.0454  1.15    28.47   26.75   256.23
29  0.0456  1.27    27.05   26.52   256.95
30  0.0452  1.46    28.86   26.45   258.95

Code inside the loop compiles to this on gcc 9.3.1 with -O3 -march=native -fopenmp. rax starts from 0 and increases by 64 each iteration. rdx points to c[i]. rsi points to b[j]. rdi points to b[j+1].

    vmovapd (%rsi,%rax), %ymm1
    vmovapd 32(%rsi,%rax), %ymm0
    vfmadd213pd (%rdx,%rax), %ymm3, %ymm1
    vfmadd213pd 32(%rdx,%rax), %ymm3, %ymm0
    vfmadd231pd (%rdi,%rax), %ymm2, %ymm1
    vfmadd231pd 32(%rdi,%rax), %ymm2, %ymm0
    vmovapd %ymm1, (%rdx,%rax)
    vmovapd %ymm0, 32(%rdx,%rax)

I wonder why the run time increases so much when the thread count increases.

My estimate says this shouldn't be the case when DIMENSION is 4096.

What I thought before I remembered that the compiler does 2 j loops at a time. Each iteration of the j loop needs rows c[i] and b[j]. They are 64KB in total. My CPU has a 32KB L1 data cache and a 256KB L2 cache per 2 threads. The four rows the two hardware threads are working with don't fit in L1 but fit in L2. So when j advances, c[i] is read from L2. When the program is run on 24 OS threads, the number of involuntary context switches is around 29371. Each thread gets interrupted before it has a chance to finish one iteration of the j loop. Since 8 matrix rows can fit in the L2 cache, the other software thread's 2 rows are probably still in L2 when it resumes. So the execution time shouldn't be much different from the 12 thread case. However measurements say it's 4 times as slow.

Now that I have realized 2 j loops are done at a time. This way each j iteration works on 96KB of memory. So 4 of them can't fit in the 256KB L2 cache. To verify this is what slows the program down, I compiled the program with -fno-loop-unroll-and-jam. I got

    vmovapd ymm0, YMMWORD PTR [rcx+rax]
    vfmadd213pd ymm0, ymm1, YMMWORD PTR [rdx+rax]
    vmovapd YMMWORD PTR [rdx+rax], ymm0

The results are in the table. They are like when 2 rows are done at a time. Which makes me wonder even more. When DIMENSION is 4096, 4 software threads' 8 rows fit in the L2 cache when each thread works on 1 row at a time, but 12 rows don't fit in the L2 cache when each thread works on 2 rows at a time. Why are the run times similar?

I thought maybe it's because the CPU warmed up when running with less threads and has to slow down. I ran the tests multiple times, both in the order of increasing thread count and decreasing thread count. They yield similar results. And dmesg doesn't contain anything related to thermal or clock.

I tried separately changing 4096 columns to 4104 columns and setting OMP_PROC_BIND=true OMP_PLACES=cores, and the results are similar.

Solution

This problem seems to come from either the CPU caches (due to the bad memory locality) or the OS scheduler (due to more threads than the hardware can simultaneously execute).

I cannot exactly reproduce the same effect on my i5-9600KF processor (with 6 cores and 6 threads) and with a matrix of size 4096x4096. However, similar effects occur.

Here are performance results (with GCC 9.3 using -O3 -march=native -fopenmp on Linux 5.6):

#threads | time (in seconds)
----------------------------
       1 | 16.726885
       2 | 9.062372
       3 | 6.397651
       4 | 5.494580
       5 | 4.054391
       6 | 5.724844  <-- maximum number of hardware threads
       7 | 6.113844
       8 | 7.351382
       9 | 8.992128
      10 | 10.789389
      11 | 10.993626
      12 | 11.099117
      24 | 11.283873
      48 | 11.412288

We can see that the computation time starts to significantly grow between 5 and 12 cores.

This problem is due to a lot more data fetched from the RAM. Indeed, 161.6 Gio are loaded from memory with 6 threads while 424.7 Gio are loaded with 12 threads! In both cases, 3.3 Gio are written to the RAM. Because my memory throughput is roughly 40 Gio/s, the RAM accesses represent more than 96% of the overall execution time with 12 threads!

If we dig deeper, we can see that the number of L1 cache references and L1 cache misses are the same whatever the number of threads used. Meanwhile, there are a lot more L3 cache misses (as well as more references). Here are L3-cache statistics:

With 6 threads:     4.4 G loads
                    1.1 G load-misses  (25% of all LL-cache hits)

With 12 threads:    6.1 G loads
                    4.5 G load-misses  (74% of all LL-cache hits)

This means that the locality of the memory access is clearly worse with more threads. I guess this is because the compiler is not clever enough to do high-level cache-based optimizations that could reduce RAM pressure (especially when the number of threads is high). You have to do tiling yourself in order to improve the memory locality. You can find a good guide here.

Finally, note that using more threads that the hardware can simultaneously execute is generally not efficient. One problem is that the OS scheduler often badly place threads to core and frequently move them. The usual way to fix that is to bind software threads to hardware threads using OMP_PROC_BIND=TRUE and set the OMP_PLACES environment variable. Another problem is that the threads are executed using preemptive multitasking with shared resources (eg. caches).

PS: please note that BLAS libraries (eg. OpenBLAS, BLIS, Intel MKL, etc.) are much more optimized than this code as most they already include clever optimization including manual vectorization for the target hardware, loop unrolling, multithreading, tiling and fast matrix transpositions when needed. For a 4096x4096 matrix, they are about 10 times faster.