performance x86 cpu-architecture hpc avx

Does anyone have an example where _mm256_stream_load_si256 (non-tempral load to bypasse cache) actually improves performance?

Consider massiveley SIMD-vectorized loops on very large amounts of floating point data (hundreds of GB) that, in theory, should benefit from non-temporal ("streaming" i.e. bypassing cache) loads/store.

Using non-temp store (_mm256_stream_ps) actually does significantly improve throughput by about ~25% over plain store (_mm256_store_ps)

However, I could not measure any difference when using _mm256_stream_load instead of _mm256_load_ps.

Does anyone have an example where _mm256_stream_load_si256 can be used to actually improves performance ?

(Instruction set & Hardware is AVX2 on AMD Zen2, 64 cores)

for(size_t i=0; i < 1000000000/*larger than L3 cache-size*/; i+=8 )
{
  #ifdef USE_STREAM_LOAD
  __m256 a = _mm256_castsi256_ps (_mm256_stream_load_si256((__m256i *)source+i));
  #else
  __m256 a = _mm256_load_ps( source+i );
  #endif

   a *= a;

  #ifdef USE_STREAM_STORE
  _mm256_stream_ps (destination+i, a);
  #else
  _mm256_store_ps (destination+i, a);
  #endif
}

Solution

stream_load (vmovntdqa) is just a slower version of normal load (extra ALU uop) unless you use it on a WC memory region (uncacheable, write-combining).

The non-temporal hint is ignored by current CPUs, because unlike NT stores, the instruction doesn't override the memory ordering semantics. We know that's true on Intel CPUs, and your test results suggest the same is true on AMD.

Its purpose is for copying from video RAM back to main memory, as in an Intel whitepaper. It's useless unless you're copying from some kind of uncacheable device memory. (On current CPUs).

See also What is the difference between MOVDQA and MOVNTDQA, and VMOVDQA and VMOVNTDQ for WB/WC marked region? for more details. As my answer there points out, what can sometimes help if tuned carefully for your hardware and workload, is NT prefetch to reduce cache pollution. But tuning the prefetch distance is pretty brittle; too far and data will be fully evicted by the time you read it, instead of just missing L1 and hitting in L2.

There wouldn't be much if anything to gain in bandwidth anyway. Normal stores cost a read + an eventual write on eviction for each cache line. The Read For Ownership (RFO) is required for cache coherency, and because of how write-back caches work that only track dirty status on a whole-line basis. NT stores can increase bandwidth by avoiding those loads.

But plain loads aren't wasting anything, the only downside is evicting other data as you loop over huge arrays generating boatloads of cache misses, if you can't change your algorithm to have any locality.

If cache-blocking is possible for your algorithm, there's much more to gain from that, so you don't just bottleneck on DRAM bandwidth. e.g. do multiple steps over a subset of your data, then move on to the next.

See also How much of ‘What Every Programmer Should Know About Memory’ is still valid? - most of it; go read Ulrich Drepper's paper.

Anything you can do to increase computational intensity helps (ALU work per time the data is loaded into L1d cache, or into registers).

Even better, make a custom loop that combines multiple steps that you were going to do on each element. Avoid stuff like for(i) A[i] = sqrt(B[i]) if there is an earlier or later step that also does something simple to each element of the same array.

If you're using NumPy or something, and just gluing together optimized building blocks that operate on large arrays, it's kind of expected that you'll bottleneck on memory bandwidth for algorithms with low computational intensity (like STREAM add or triad type of things).

If you're using C with intrinsics, you should be aiming higher. You might still bottleneck on memory bandwidth, but your goal should be to saturate the ALUs, or at least bottleneck on L2 cache bandwidth.

Sometimes it's hard, or you haven't gotten around to all the optimizations on your TODO list that you can think of, so NT stores can be good for memory bandwidth if nothing is going to re-read this data any time soon. But consider that a sign of failure, not success. CPUs have large fast caches, use them.

Does anyone have an example where _mm256_stream_load_si256 (non-tempral load to bypasse cache) actually improves performance?

Further reading: