Search code examples
intelintrinsicsxeon-phiavx512

_mm512_storenr_pd and _mm512_storenrngo_pd


What is the difference between _mm512_storenrngo_pd and _mm512_storenr_pd?

_mm512_storenr_pd(void * mt, __m512d v):

Stores packed double-precision (64-bit) floating-point elements from v to memory address mt with a no-read hint to the processor.

It is not clear to me, what no-read hint means. Does it mean, that it is a non-cache coherent write. Does it mean, that a reuse is more expensive or not coherent?

_mm512_storenrngo_pd(void * mt, __m512d v):

Stores packed double-precision (64-bit) floating-point elements from v to memory address mt with a no-read hint and using a weakly-ordered memory consistency model (stores performed with this function are not globally ordered, and subsequent stores from the same thread can be observed before them).

Basically the same as storenr_pd, but since it uses a weak consistency model, this means that a process can view its own writes before any other processor. But the access of another processor is non-coherent or more expensive?


Solution

  • Quote from Intel® Xeon Phi™ Coprocessor Vector Microarchitecture:

    In general, in order to write to a cache line, the Xeon Phi™ coprocessor needs to read in a cache line before writing to it. This is known as read for ownership (RFO). One problem with this implementation is that the written data is not reused; we unnecessarily take up the BW for reading non-temporal data. The Intel® Xeon Phi™ coprocessor supports instructions that do not read in data if the data is a streaming store. These instructions, VMOVNRAP*, VMOVNRNGOAP* allow one to indicate that the data needs to be written without reading the data first. In the Xeon Phi ISA the VMOVNRAPS/VMOVNRPD instructions are able to optimize the memory BW in case of a cache miss by not going through the unnecessary read step.

    The VMOVNRNGOAP* instructions are useful when the programmer tolerates weak write-ordering of the application data―that is, the stores performed by these instructions are not globally ordered. This means that the subsequent write by the same thread can be observed before the VMOVNRNGOAP instructions are executed. A memory-fencing operation should be used in conjunction with this operation if multiple threads are reading and writing to the same location.

    It seems that "No-read hints", "Streaming store", and "Non-temporal Stream/Store" are used interchangeably in several resources.

    So yes it is non-cache coherent write, though with Knights Corner (KNC, where both vmovnrap* and vmovnrngoap* belong) the stores happen to L2 cache, it does not bypass all levels of cache.

    As explained in above quote, vmovnrngoap* is special from vmovnrap* that weakly-ordered memory consistency model allows "subsequent write by the same thread can be observed before the VMOVNRNGOAP instructions are executed", so yes the access of another thread or processor is non-coherent, and a fencing operation should be used. Though CPUID can be used as the fencing operation, better options are "LOCK ADD [RSP],0" (a dummy atomic add) or XCHG (which combines a store and a fence).

    A few more details:

    NR Stores.The NR store instruction (vmovnr) is a standard vector store instruction that can always be used safely. An NR store instruction that misses in the local cache causes all potential copies of the cache line in remote caches to be invalidated, the cache line to be allocated (but not initialized) at the local cache in exclusive state, and the write-data in the instruction to be written to the cacheline. There is no data transfer from main memory which is what saves memory bandwidth. An NR store instruction and other load and/or store instructions from the same thread are globally ordered, which means that all observers of this sequence of instructions always see the same fixed execution order.

    The NR.NGO (non-globally ordered) store instruction(vmovnrngo) relaxes the global ordering constraint of the NR store instruction.This relaxation makes the NR.NGO instruction have a lower latency than the NRinstruction, which can be used to achieve higher performance in streaming storeintensive applications. However, removing this restriction means that an NR.NGO store instruction and other load and/or store instructions from the same thread can be observed by two observers to have two different orderings. The use of NR.NGO store instructions is safe only when reordering the order of these instructions is verified not to change the outcome. Otherwise, using NR.NGO stores may lead to incorrect execution. Our compiler can generate NR.NGO store instructions for store instructions that it identifies to have non-temporal behavior. For instance, a parallel loop that is detected to be non-temporal by our compiler can make use of NR.NGO instructions. At the end of such a loop, to ensure all outstanding non-globally ordered stores are completed and all threads have a consistent view of memory, our compiler generates a fence (a lock instruction) after the loop. This fence is needed before continuing execution of the subsequent code fragment to ensure all threads have exactly the same view of memory.

    A general rule of thumb is that non-temporal store benefit memory access blocks that are not reused in the immediate future. So that yes reuse will be expensive in both cases.