Are there C/C++ intrinsics for non-temporal loads (i.e. loads without caching, directly from DRAM) of 32- and 64-bit values on x86_64?
My compiler is MSVC++2017 toolset v141. But intrinsics for other compilers are welcome, as well as references to the underlying assembly instructions.
At the time of writing (August 2017) there are no non-temporal loads to GP registers.
The only available non-temporal instructions are:
Integer domain
(v)movntdqa
(load) despite the name this instruction moves 128/256/512 bits, aligned on their natural boundary, intoxmm/ymm/zmm
registers respectively.
(v)movntdq
(store) despite the name this instruction movesxmm/ymm/zmm
registers into a 128/256/512 bits, aligned on their natural boundary, memory location.
GP registers
movnti
(store) store a 32/64-bit GP register into a DWORD/QWORD in memory.
MMX registers
movntq
(store) store an MMX register into a QWORD in memory.
Floating point domain
(v)movntpd/s
(store) (legacy and VEX encoded) store axmm/ymm/zmm
register into an aligned 128/256/512 bits memory location. Likemovntdq
but in the FP domain.
(v)movntpd/s
(store) (EVEX encoded) store axmm/ymm/zmm
register into an aligned 512 bits memory location clearing the upper unused bits. Likemovntdq
but in the FP domain.
Intel manuals are contradictory on this
Masked movs
(v)maskmovdqu
(store) stores the bytes of anxmm
register according to the mask in anotherxmm
register.
(v)maskmovq
(store) stores the bytes of an MMX register according to the mask in another MMX register.