rendering openmp renderer hyperthreading slowdown

Hyper-threading... made my renderer 10 times slower

Executive summary: How can one specify in his code that OpenMP should only use threads for the REAL cores, i.e. not count the hyper-threading ones?

Detailed analysis: Over the years, I've coded a SW-only, open source renderer (rasterizer/raytracer) in my free time. The GPL code and Windows binaries are available from here: https://www.thanassis.space/renderer.html It compiles and runs fine under Windows, Linux, OS/X and the BSDs.

I introduced a raytracing mode this last month - and the quality of the generated pictures sky-rocketed. Unfortunately, raytracing is orders of magnitude slower than rasterizing. To increase speed, just as I did for the rasterizers, I added OpenMP (and TBB) support to the raytracer - to easily make use of additional CPU cores. Both rasterizing and raytracing are easily amenable to threading (work per triangle - work per pixel).

At home, with my Core2Duo, the 2nd core helped all the modes - both the rasterizing and the raytracing modes got a speedup that is between 1.85x and 1.9x.

The problem: Naturally, I was curious to see the top CPU performance (I also "play" with GPUs, preliminary CUDA port), so I wanted a solid base for comparisons. I gave the code to a good friend of mine, who has access to a "beast" machine, with a 16-core, 1500$ Intel super processor.

He runs it in the "heaviest" mode, the raytracer mode...

...and he gets one fifth the speed of my Core2Duo (!)

Gasp - horror. What just happened?

We started trying different modifications, patches, ... and eventually we figured it out.

By using the OMP_NUM_THREADS environment variable, one can control how many OpenMP threads are spawned. As the number of threads was increasing from 1 up to 8, the speed was increasing (close to a linear increase). The moment we crossed 8, speed started to diminish, until it nose-dived to one fifth the speed of my Core2Duo, when all 16 cores were used!

Why 8?

Because 8 was the number of the real cores. The other 8 were... hyperthreading ones!

The theory: Now, this was news to me - I've seen hyper-threading help a lot (up to 25%) in other algorithms, so this was unexpected. Apparently, even though each hyper-threading core comes with its own registers (and SSE unit?), the raytracer could not make use of the extra processing power. Which lead me to think...

It is probably not processing power that is starved - it is memory bandwidth.

The raytracer uses a bounding volume hierarchy data structure, to accelerate ray-triangle intersections. If the hyperthreaded cores are used, then each of the "logical cores" in a pair, is trying to read from different places in that data structure (i.e. in memory) - and the CPU caches (local per pair) are completely thrashed. At least, that's my theory - any suggestions most welcome.

So, the question: OpenMP detects the number of "cores" and spawns threads to match it - that is, it includes the hyperthreaded "cores" in the calculation. In my case, this apparently leads to disastrous results, speed-wise. Does anyone know how to use the OpenMP API (if possible, portably) to only spawn threads for the REAL cores, and not the hyperthreaded ones?

P.S. The code is open (GPL) and available at the link above, feel free to reproduce on your own machine - I am guessing this will happen in all hyperthreaded CPUs.

P.P.S. Excuse the length of the post, I thought it was an educational experience and wanted to share.

Solution

Basically, you need some fairly portable way of querying the environment for fairly low-level hardware details - and generally, you can't do that from just system calls (the OS is generally unaware even of the difference between hardware threads and cores).

One library which supports a number of platforms is hwloc - supports Linux & windows (and others), intel & amd chips. Hwloc will let you find everything out about the hardware topology, and knows the difference between cores and hardware threads (called PUs - processing units - in hwloc terminology). So you'd call this library at the start, find the number of actual cores, and call omp_set_num_threads() (or just add that variable as a directive at the start of parallel sections).