Search code examples
multithreadingopenmp

openmp: performance decreases with multiple threads on my desktop, but the opposite over my server


I believe I read pretty much about StackOverflow threads regarding decreasing performance when increasing threads number with OpenMP. Mostly they were due to false sharing. My situation is quite different because my two machines are showing the opposite result.

I'm running STREAM benchmark, and information about my machines is written below:

  • Intel Xeon Gold-6148. 20 cores (40 threads), 2.4 GHz, 27.5MB LLC
  • Intel Core i5-9400. 6 cores (6 threads), 2.9 GHz, 9MB LLC

Memory information is similar between the two machines so I will omit it.

I ran the benchmark quite a lot of times and checked the variance between runs is small enough. The result is quite interesting.

Gold-6148(a.k.a server) gets enhanced results when increasing threads number with the OMP_NUM_THREADS option. However, the result of i5-9400(a.k.a desktop) decreases with multiple threads number.

I set the STREAM_ARRAY_SIZE with 20m, and double-checked with various sizes, so it won't affect the result. Also, I doubted maybe because of any difference over glibc/gomp library between the two, but no difference.

Any idea why this is happening? I just absentmindedly watching the chart below again and again... i5-9400 result gold-6148 result


Solution

  • Memory information is similar between the two machines so I will omit it.

    You cannot simply omit the most important information when talking about the STREAM benchmark. Xeon Gold 6148 has six DDR4-2666 memory channels split over two separate memory controllers while i5-9400 (assuming i5-9700 is a typo since i7-9700 is an octo-core i7 and not a hexa-core i5 CPU) has only two DDR4-2666 memory channels on a single memory controller. Therefore, 6148 with memory modules installed on all six channels is capable of delivering 3x the memory bandwidth of i5-9400 with the same type of memory modules on both channels. It can also handle more simultaneous memory requests and therefore provides better memory utilisation with more than one thread. Thus, the actual memory configuration is quite important.

    Interpreting the STREAM results requires deep understanding of the underlying CPU architecture. There is a nice article by Georg Hager on that topic.