Search code examples
c++cassemblyintrinsicscpu-cache

C/C++ intrinsics for non-temporal loads of 32- and 64-bit values on x86_64?


Are there C/C++ intrinsics for non-temporal loads (i.e. loads without caching, directly from DRAM) of 32- and 64-bit values on x86_64?

My compiler is MSVC++2017 toolset v141. But intrinsics for other compilers are welcome, as well as references to the underlying assembly instructions.


Solution

  • At the time of writing (August 2017) there are no non-temporal loads to GP registers.


    The only available non-temporal instructions are:

    Integer domain

    (v)movntdqa (load) despite the name this instruction moves 128/256/512 bits, aligned on their natural boundary, into xmm/ymm/zmm registers respectively.
    (v)movntdq (store) despite the name this instruction moves xmm/ymm/zmm registers into a 128/256/512 bits, aligned on their natural boundary, memory location.

    GP registers

    movnti (store) store a 32/64-bit GP register into a DWORD/QWORD in memory.

    MMX registers

    movntq (store) store an MMX register into a QWORD in memory.

    Floating point domain

    (v)movntpd/s (store) (legacy and VEX encoded) store a xmm/ymm/zmm register into an aligned 128/256/512 bits memory location. Like movntdq but in the FP domain.

    (v)movntpd/s (store) (EVEX encoded) store a xmm/ymm/zmm register into an aligned 512 bits memory location clearing the upper unused bits. Like movntdq but in the FP domain.
    Intel manuals are contradictory on this

    Masked movs

    (v)maskmovdqu (store) stores the bytes of an xmm register according to the mask in another xmm register.

    (v)maskmovq (store) stores the bytes of an MMX register according to the mask in another MMX register.