c#intel-mkl avx2 avx512 mathnet-numerics

Mathnet Numerics with Intel MKL running much slower on Intel Xeon Gold than an old i7-7700HQ laptop

I have several functions doing matrix computations using MathNet Numerics + Intel MKL provider. The matrices are not too large, something like 40x100, and the operations involve some pseudoinverses, eigenvalues, and similar linear algebra stuff.

However, having created a small benchmark app to just run the calculation 1000 times, it turns out that our new Intel Xeon Gold 6226R (16 core, 32 threads) runs the calculations 2x slower than my old i7 7700HQ (4 core, 8 thread) laptop, and slower than basically all 5-6 year old PCs I could test.

I have tried both "native Intel MKL" and "managed" multi-threaded providers. MKL is ~2x slower, while the managed one is perhaps 10% faster.

Furthermore, when running the test on my i7 laptop, I get ~80% CPU utilization and it runs at ~3.5GHz. But the Xeon lowers the frequency to around ~2GHz and then uses ~30-40% utilization. Both the laptop (Windows 10) and the server (Windows Server 2019) are set to High Performance mode.

I am pretty sure that there should be some trick I am missing here?

(Update)

In case someone has this same issue, it's possible to disable AVX512 by setting the MKL_ENABLE_INSTRUCTIONS env. variable to AVX2. This speeds up MKL a bit on Xeon:

SET MKL_ENABLE_INSTRUCTIONS=AVX2

But this still doesn't make it as fast as the i7.

The only way where the Xeon can beat the i7 if I disable MKL and use the MathNet Numerics managed provider, and then also use a Parallel.For loop to run all in parallel. In that case, Xeon is ~30% faster, although admittedly its CPU usage still doesn't get over 40% (the i7 CPU is maxed out).

Still a bit disappointing considering how many extra cores it provides.

Solution

Lots of small matrices probably mean it's hard to scale to lots of cores without manual vectorization (not just within one call to an MKL function), so that could explain the utilization on a 32 vs. 8 logical core machine. 2GHz might be the max AVX-512 frequency on that chip, you'd have to check on that and if MKL is using it. (SIMD instructions lowering CPU frequency). And 512-bit vectors might be an odd multiple of the row or column sizes, perhaps not being great.

Your Kaby Lake can run 2x 256-bit FMA per clock, and has the same microarchitecture (inside each core) as your Xeon Gold (Skylake) except for AVX-512 and having 256k L2 instead of 1MiB. Xeon Gold 6226R does have 2x 512-bit FMA units (unlike some Xeon-SP chips), so it is capable of twice the FLOPS per core per clock if it can keep them fed with work.

Intel is always at least several months behind with rolling out server versions of new microarchitectures, and due to their 10nm production problems, we haven't had a really new server microarchitecture until Ice-Lake Xeons this year. Cascade Lake is just an efficiency optimization (power / clocks) on Skylake-X, so it's still just the same cores as client chips from Skylake through Coffee Lake at least, except for AVX-512 (and different extensions thereof) and those larger L2 caches.

But if your workload isn't mostly limited by per-core FMA throughput, and is bound on throughput of other operations like FP division, then it's totally normal that your Kaby Lake is faster per-clock, and it clocks higher.

Also, "client" chips like your laptop have lower inter-core latency than the mesh interconnect on big Xeons, which might reduce overhead for small matrices if MKL tries to parallelize problems that are barely worth it. (Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?)

Your Xeon should have good aggregate throughput if you do lots of these matrix operations in parallel, one per thread, rather than trying to parallelize one smallish matrix operation after another (i.e. reduce the latency of one operation by distributing work; instead let one core do all the work for one operation so the data stays hot in its L1d or L2 cache).